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Nonlinear  systems  can  exhibit  a wide  spectrum  of  dynamical  behaviors,  which 
include  fascinating  chaotic  dynamics.  A chaotic  system  is  deterministic,  but  the  system 
output  is  not  predictable  in  the  long  run.  Conventional  signal  modeling  techniques  assume 
that  these  signals  were  generated  by  some  linear  system  driven  by  random  noise,  but  they 
are  not  appropriate  to  model  the  underlying  nonlinear  dynamics. 

In  this  research,  a methodology  to  model  the  nonlinear  deterministic  dynamics 
from  a signal  is  established.  This  approach  is  called  dynamic  modeling.  Due  to  their  plas- 
ticity and  powerful  function  approximation  capability,  in  this  work  artificial  neural  net- 
works (ANNs)  are  chosen  as  the  nonlinear  modeling  tool. 

In  dynamic  modeling,  there  are  three  main  tasks:  dynamical  reconstrucd on,  noise 
reduction,  and  model  training.  To  reconstruct  the  underlying  dynamics  from  a signal,  a 
new  method  based  on  Poisson  Moment  Functionals  is  proposed  which  can  be  integrated  as 
part  of  data  acquisition.  A filtering  process  is  elegantly  embedded  in  this  method.  There- 
fore, the  noise  effect  can  be  reduced  when  we  measure  the  properties  of  the  reconstructed 
dynamics,  i.e.  dynamical  invariants.  Noise  reduction  can  be  considered  as  a preprocessing 
stage  for  dynamic  modeling.  Because  chaotic  signals  have  noise-like  broadband  spectrum, 
a linear  filter  may  not  remove  noise  effectively  and  may  distort  the  dynamical  invariants. 
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Projecting  a reconstructed  state-space  trajectory  onto  the  signal  subspace  to  remove  high- 
dimensional noise  can  be  regarded  as  a state-space  filtering  process.  In  this  research,  we 
further  propose  to  train  a neural  network,  equipped  with  a special  memory  structure,  as  a 
state  predictor  and  further  reduce  noise  in  the  signal  subspace. 

The  task  of  model  training  includes  the  adaptation  scheme,  the  selection  of  training 
parameters,  and  the  model  validation.  No  systematic  studies  on  this  topic  are  known  in  the 
literature.  Here,  we  propose  trajectory  learning  as  a more  accurate  tool  for  dynamical 
modeling  with  neural  networks.  Through  theoretical  analysis  and  experimental  verifica- 
tion, the  superiority  of  the  trajectory  learning  over  the  fixed-point  learning  in  dynamic 
modeling  can  be  justified.  For  chaotic  signals,  the  length  of  the  training  sequences  has 
been  found  crucial  to  avoid  oscillations  during  training.  We  propose  several  criteria  for 
this  selection  according  to  the  dynamical  properties  of  the  signals.  For  model  validation, 
we  propose  to  synthesize  a time  series  from  the  autonomous  ANN  dynamical  model,  and 
measure  its  dynamical  invariants.  The  original  and  synthesized  signals  should  have  similar 
dynamical  invariants  for  a successful  dynamic  modeling. 

The  proposed  modeling  procedure  has  been  applied  to  several  chaotic  and  noncha- 
otic  signals,  which  are  all  generated  from  equations.  Therefore,  the  performance  of  the 
modeling  can  be  verified.  The  experimental  results  show  that  the  proposed  methodology 
can  actually  yield  satisfactory  dynamic  models  for  these  test  signals. 
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CHAPTER  1 
INTRODUCTION 


1.1  Motivations 

Since  the  work  of  Yule  in  1927  to  predict  the  sunspot  cycle  with  a linear  model 
[Yul27],  the  effort  of  modeling  time  series  with  equations  of  motion  continued  in  a wide 
variety  of  research  areas.  This  modeling  procedure  is  usually  the  first  step  towards  under- 
standing and,  in  some  cases,  controlling  the  system  under  analysis.  Each  set  of  equations 
of  motion  represents  a dynamical  system.  Therefore,  this  modeling  approach  is  a process 
to  extract  the  underlying  dynamics  from  a given  signal. 

In  conventional  signal  modeling,  we  usually  assume  that  a complex  signal  is  the 
output  of  a linear  system  driven  by  random  noise.  In  other  words,  signals  are  treated  as 
realizations  of  some  random  processes,  and  the  underlying  systems  are  modeled  as  a linear 
system  [Mar87][Pap91].  After  the  discovery  of  chaos,  we  now  know  of  some  determinis- 
tic systems  with  few  degrees  of  freedom  which,  without  any  random  input,  can  also  pro- 
duce signals  that  exhibit  uncertainty  and  possess  noise-like  spectra.  These  systems  are 
named  chaotic  dynamical  systems  [Eck85].  They  can  be  found  almost  everywhere,  e.g. 
weather  patterns,  fluid  flows,  astronomical  motions,  laser  beams,  biological  systems,  and 
electric  circuits,  just  to  name  a few  [Lor63][Gal91][Chu88].  A chaotic  system  is  a nonlin- 
ear dynamical  system,  and  the  uncertainty  existing  in  its  output  is  originated  from  the  sys- 
tem dynamics  instead  of  an  external  driving  force.  Therefore,  it  is  inappropriate  to  apply 
the  conventional  technique  to  model  the  underlying  dynamics  of  a chaotic  signal. 

Signals  are  usually  contaminated  by  noise.  A chaotic  signal  has  a noise-like  broad- 
band spectrum.  Thus,  it  is  extremely  difficult  and  usually  not  effective  to  use  a frequency 
filter  to  remove  noise  from  a chaotic  time  series.  In  addition,  although  chaotic  signals  can 
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be  described  in  statistical  manner,  their  underlying  systems  are  intrinsically  deterministic. 
Therefore,  we  find  it  is  also  inappropriate  to  design  an  optimal  filter,  like  Kalman  filters 
[Boz79],  to  take  out  the  noise  from  the  signals. 

'^us,  the  need  of  a new  modeling  technique  and  a noise  reduction  procedure  for 
chaotic  time  series  is  the  motivation  behind  this  research. 

1 .2  Framework  and  Objectives 

In  experimental  dynamical  systems,  a procedure  to  reconstruct  a state-space  trajec- 
tory from  the  output  of  a system  was  introduced  by  Packard  et  al.  [Pac80].  Later,  Takens 
proved  that  the  reconstruction  preserves  two  important  qualitative  descriptors  of  the  origi- 
nal system  dynamics;  the  dimension  and  the  Lyapunov  exponents  of  the  system  attractor 
[TakSl].  The  dimension  of  a system  attractor  represents  the  number  of  irreducible  degrees 
of  freedom;  while  the  positive  (negative)  Lyapunov  exponents,  which  are  the  generaliza- 
tion of  eigenvalues  in  linear  systems,  indicate  the  growth  (destruction)  rate  of  the  informa- 
tion in  the  system  flow.  These  two  measurements  are  called  dynamical  invariants  because 
they  will  not  change  when  we  apply  any  diffeomorphic  transformation  to  the  system 
dynamics.  This  reconstruction  establishes  a bridge  between  a time-domain  signal  and  a 
state-space  representation  of  the  original  system.  In  this  modeling  approach,  the  resulting 
model  is  required  not  only  to  preserve  the  spectrum  of  a given  signal  (as  in  spectral  analy- 
sis) but  also  to  be  capable  of  producing  a time  series  which  possesses  the  same  dynamical 
invariants  as  the  original  signal.  In  other  words,  the  resulting  model  is  required  to  capture 
the  underlying  dynamics  of  the  signal.  Thus,  this  modeling  procedure  will  be  called 
dynamic  modeling. 

Although  chaotic  time  series  are  of  the  main  interest  in  this  research,  I also  extend 
this  work  to  non-chaotic  signals.  In  a more  general  perspective,  the  procedure  of  the 
dynamic  modeling  can  be  applied  to  any  signal  originated  from  deterministic  systems. 
However,  as  with  the  other  modeling  techniques,  we  need  to  make  some  assumptions 
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about  signals  and  their  underlying  systems.  Since  we  try  to  model  the  dynamics  of  the  sys- 
tem that  produced  the  signal,  the  noise  level  in  the  signal  is  assumed  to  be  small.  Other- 
wise, the  proposed  modeling  procedure  may  result  in  a model  combining  the  dynamics  of 
the  signal  with  the  noise.  The  other  assumption  is  that  the  underlying  system  is  autono- 
mous. In  other  words,  the  rules  that  govern  the  underlying  dynamics  will  not  change  with 
time.  This  is  similar  to  the  assumption  of  stationarity  in  conventional  signal  modeling. 
Both  assumptions  will  ensure  that  a set  of  time-invariant  coefficients  of  the  model  can  rep- 
resent the  underlying  system. 

The  output  of  a high-dimensional  chaotic  system  with  large  Lyapunov  exponents 
will  behave  very  similarly  to  random  noise.  It  will  become  extremely  difficult  to  model 
the  underlying  dynamics  from  the  output  signal.  Therefore,  the  proposed  modeling  tech- 
niques are  mainly  for  signals  whose  underlying  systems  have  low  dimensionality  (less 
than  4)  and  small  positive  Lyapunov  exponents  (less  than  2.5  bits/sec). 

Since  the  underlying  system  of  a chaotic  time  series  is  nonlinear,  a nonlinear 
model  can  be  expected  to  perform  better  than  a linear  model.  Artificial  neural  networks 
(ANNs)  are  a powerful  nonlinear  modeling  tool.  The  function  approximation  capability  of 
the  ANNs  has  been  shown  superior  to  the  conventional  polynomial  approximation 
[Lap87]  [Wei90]  [Mea91].  In  this  work,  I use  a special  class  of  neural  networks,  which  is 
named  dynamic  neural  network,  to  establish  a global  model  of  the  reconstruction  dynam- 
ics of  a given  signal. 

The  reconstruction  procedure  also  enables  us  to  separate  noise  from  a signal  based 
upon  the  differences  in  their  dynamical  properties.  This  approach  can  be  regarded  as  a 
state-space  filtering  technique.  For  a signal  with  broad-band  spectrum,  the  result  of  this 
filtering  process  will  not  severely  distort  the  frequency  components  in  some  specific 
bands.  In  this  work,  I train  a dynamic  network  with  special  type  of  memory  device  as  a 
trajectory  predictor  to  further  reduce  noise. 
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As  to  the  tasks,  this  research  can  be  divided  into  three  parts:  trajectory  reconstruc- 
tion,  noise  reduction,  and  dynamic  modeling.  This  is  shown  in  Figure  1-1.  The  noise  level 
of  the  signal  used  in  the  dynamic  modeling  is  required  to  be  small.  Thus,  the  noise  reduc- 
tion procedure  can  be  regarded  as  a pre-processing  for  the  dynamic  modeling  block.  For 
the  validation  of  a dynamic  model,  we  propose  to  reconstruct  a trajectory  from  the  output 
of  the  model  and  compare  the  dimension  and  the  Lyapunov  exponents  of  the  trajectory 
with  those  obtained  from  the  original  signal.  Besides  the  relationship  among  these  three 
tasks,  I also  list  my  contributions  in  the  diagram. 


Figure  1-1.  Partition  of  the  tasks  in  this  research. 


At  this  time,  the  modeling  of  chaotic  signals  is  still  in  its  infancy.  Most  results 
reported  in  the  literature  are  about  the  prediction  of  the  signals.  Although  in  Chapter  2 I 

show  that  a dynamic  model  can  also  be  formulated  as  a predictive  model,  the  goal  and  the 

1 

approach  to  establish  a dynamic  model  of  a signal  should  be  different  from  those  to  estab- 
lish a predictive  model  of  the  signal.  In  this  work,  I propose  to  impose  constraints  on  the 
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iterative  map  of  a dynamic  model  in  modeling  and  compute  the  dynamical  invariants  of 
the  model  in  model  validation.  These  two  propositions  distinguish  my  work  from  most  of 
the  others’. 


1.3  Literature  Review 

Since  most  of  the  work  that  was  done  in  modeling  chaotic  time  series  is  just  a non- 
linear extension  of  the  conventional  signal  modeling,  let  us  quickly  review  this  technique 
first.  In  conventional  signal  modeling,  a parameterized  description  of  the  underlying  sys- 
tem can  be  obtained  by  establishing  a predictive  model  of  the  system.  In  the  discrete-time 
case,  this  can  be  formulated  as 

=/(4>(O,0)  , eq.  1.3-1 

<1>^(0  ==  1)  ...y{t-p)  u(t)  II  (t-])  ...  u(t-q)\ 

where  y (r|  0)  is  the  prediction  of  the  signal  at  time  t based  upon  the  coefficient  vector  0, 
y{t)  is  the  data  sample  of  the  signal,  and  ii  (/)  is  the  data  sample  of  the  input  to  the 
model,  which  is  usually  assumed  to  be  random  noise.  A special  case  is  obtained  when  the 
prediction  output  is  a linear  function  of  the  0 and  the  (j)  (/) , or 

j)(f|0)  = 0^(1)(/)  eq.  1.3-2 

where  T is  the  transpose  operator.  This  is  the  well  known  Autoregressive-  Moving  Aver- 
age (ARMA)  model  [Box70][Kay81].  If  q in  eq.  1.3-1  is  equal  to  zero,  then  eq.  1.3-2 
becomes  the  Autoregressive  (AR)  model,  which  is  one  of  the  most  popular  models  in  sig- 
nal modeling.  No  matter  which  model  we  use,  the  coefficient  vector  is  generally  deter- 
mined by  fitting  the  data  samples  of  the  signal  with  the  selected  model.  According  to  some 
designed  cost  function,  a set  of  optimal  (or  sometimes  suboplimal)  coefficients  can  be  thus 
determined.  In  the  ARMA  modeling,  there  are  two  adaptive  scliemes  that  have  been  used 
in  deciding  coefficients.  One  is  called  equation  error  scheme,  and  the  other  is  output  error 
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scheme  [Shy89],  Let  us  rewrite  eq.  1.3-2  in  a more  convenient  form  using  delay-operator 
notation  as  follows: 

B {!,<!)  = 2 

/ = 0 

• ...  *^(/) 

where  y{t)z~'  =y{t- i)  . Because  during  adaptation  the  coefficients  will  change  with 
time,  the  vector  0 becomes  time-dependent.  In  fact,  this  is  the  formulation  for  the  equa- 
tion output  scheme.  A schematic  diagram  of  this  formulation  is  given  in  Figure  1-2. 


A{t,p)  = 

i=  1 


-/ 


and  0^(/)  = 


a,  (0 


Figure  1-2.  Schematic  diagram  of  the  equation  output  formulation. 


In  this  scheme,  instead  of  feeding  back  the  prediction  output  of  the  model  we  clock  in  the 
true  data  samples  of  the  signal.  Although  using  this  scheme  we  can  avoid  the  local  minima 
and  the  divergence  problem  peculiar  to  systems  with  feedback  loops,  the  result  may  be 
biased  away  from  the  optimal  solution  [Shy89].  This  is  because  after  we  obtain  a set  of  the 
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optimal  coefficients,  {a.}  and  {b.} , we  assemble  tlie  model  as  shown  in  the  shaded  area 
of  the  diagram  in  Figure  1-2.  In  other  words,  the  scheme  that  we  use  to  obtain  the  coeffi- 
cients is  different  from  the  scheme  we  operate  the  model. 

In  the  output  error  formulation,  we  modify  eq.  1.3-3  as 

j)(/|0(O)  =A(t,p)y(t\QiO)+B((,cj)uii)  eq.  1.3-4 

The  schematic  diagram  corresponding  to  this  formulation  is  shown  in  Figure  1-3.  In  this 
scheme,  we  build  the  model  in  the  same  way  as  we  use  it.  However,  since  the  model  out- 
put becomes  a nonlinear  function  of  the  feedback  coefficients,  there  exists  local  minima  in 
the  performance  curve.  In  addition,  the  control  of  the  stability  during  adaptation  will 
become  very  difficult  if  p is  larger  than  2. 


Figure  1-3.  Schematic  diagram  of  the  output  error  formulation. 


In  the  discrete-time  case,  a dynamical  system  can  be  expressed  as 

^(^+1)  =f(x(0)  eq.  1.3-5 

where  x (t)  is  the  system  state  at  time  t,  and/is  called  the  vector  field  of  the  system.  For  a 
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given  initial  condition,  the  iterates  of  the  mapping  / will  yield  a state-space  trajectory.  We 
note  that  in  this  formulation  the  mapping  / can  also  be  viewed  as  a predictive  model  of  the 
system  states.  Assume  we  already  select  a model  for  the  mapping  /.  From  a given  signal, 
we  can  reconstruct  a trajectory  of  the  system,  and  the  reconstructed  state  vectors  can  be 
used  in  computing  the  coefficients  of  the  model.  Most  researchers  established  the  model  in 
a way  similar  to  the  equation  output  scheme  in  the  ARMA  modeling  [Cru87] 
[Cas91][Lap87][Mea91][Wei90].  In  other  words,  they  feed  in  the  current  state  vector 
(instead  of  the  previous  prediction  output  of  the  model)  and  adjust  the  parameters  of  the 
model  such  that  the  output  of  the  model  becomes  close  to  the  next  state.  In  this  manner,  no 
direct  constraints  are  imposed  on  the  iterative  map  of  the  model  when  optimizing  the 
parameters.  In  dynamic  modeling,  the  iterates  of  the  model  are  required  to  preserve  the 
dynamical  invariants  of  the  original  system.  Therefore,  the  resulting  model  of  this  scheme 
may  be  a good  one-step  state  predictor  but  a poor  dynamic  model.  To  the  best  of  my 
knowledge,  besides  my  work  there  is  only  one  other  research  group  that  designates  a cost 
function  to  include  the  constraint  on  the  iterative  map  of  the  model 
[Pri92b][Kuo92][Eis91].  The  cost  function  proposed  by  Eisenhammer  et  al.  is  as  follows 
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eq.  1.3-6 


where  ||  ||  is  the  Euclidean  norm,  are  the  model  states,  are  the  reconstruction  states, 
t.  are  the  / times  when  the  initial  states  are  taken  and  A/,  are  the  / times  after  which 

j 'max  I max 

reconstruction  and  model  states  are  compared.  However,  the  dynamics  that  Eisenhammer 
et  al.  tried  to  model  are  restricted  to  a limit  cycle  from  human  limb  movement,  and  there  is 
no  criterion  proposed  to  decide,  / , which  is  the  number  of  iterates  of  the  model. 


The  approach  that  we  just  mentioned  is  a global  modeling  technique.  There  is  only 
one  global  model  that  represents  the  entire  reconstruction  dynamics  of  a signal.  Another 
approach  is  to  partition  a state  space  into  small  regions  and  establish  a local  model  to 
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describe  the  dynamics  in  each  region.  Usually,  a linear  model  is  chosen  in  this  approach.  A 
linear  approximation  of  eq.  1.3-5  is 

X (/ + 1)  « Jx(r)  + 6 eq.  1.3-7 

A 

where  J is  the  Jacobian  of/  at  x (t) , and  b is  a constant  vector.  This  model  and  its  varia- 
tions were  adopted  by  several  research  groups  for  different  applications  [Far87][Eckm86] 
[Kos90][San85].  This  local  linearization  model  is  only  valid  in  a small  neighborhood. 
Therefore,  a large  number  of  state-space  partitions  may  be  required.  Consequently,  a large 
amount  of  data  samples  is  needed  in  establishing  the  local  models  for  all  of  the  regions. 
This  approach  usually  can  yield  very  precise  short-term  prediction.  However,  a big  draw- 
back of  using  local  linear  models  was  reported  by  Crutchfield  and  McNamara[Cru87].  In 
their  investigations  of  piece-wise  linear  models,  they  found  that  these  models  are  unreli- 
able indicators  of  the  underlying  deterministic  behavior.  For  instance,  when  we  compose 
these  piece-wise  linear  models  to  describe  the  global  behavior  we  may  obtain  a periodic 
dynamics  while  the  underlying  system  is  chaotic.  Therefore,  in  dynamic  modeling  I 
choose  the  global  modeling  approach. 

There  are  several  global  models  that  had  been  proposed  by  different  research 
groups.  Lapedes  and  Farber  trained  a feedforward  artificial  neural  network  as  a one-step 
predictor  of  a chaotic  time  series  [Lap87].  They  showed  that  the  resulting  neural  network, 
by  iteration,  can  produce  multi-step  predictions  which  are  much  more  accurate  than  the 
outputs  of  a polynomial  or  a Widrow-Hoff  (adaptive  FIR)  model.  They  also  reported  that 
the  iterates  of  a polynomial  model  will  blow  up  very  fast.  Basically,  Weigend  et  al.  fol- 
lowed the  same  approach  as  Lapedes  and  Farber,  but  a modified  cost  function  is  designed 
to  reduce  those  insignificant  parameters  [Wei90]. 

In  predicting  chaotic  time  series,  Casdagli  proposed  to  approximate  the  partial 
mapping  of /,  denoted  by  ti/ (This  partial  mapping  will  be  shown  to  be  a predictive  model 
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of  the  signal  in  Chapter  2),  as 


N 

n = \ 


a 


Jk 

x-x 


n 


) + Z 

I)  = 1 


eq.  1.3-8 


Jk 

where  x is  the  state  vector,  p is  a radial  basis  function,  is  the  center  of  a radial  function, 
{p  : n=l,2,...,^}  is  a set  of  polynomial  bases  of  degree  at  most  d,  and  and  are  the 
parameters  which  will  be  determined  by  using  least  squares.  Since  there  is  no  constraint 
imposed  on  the  iterative  map  of  nf,  the  temporal  index  of  the  state  vectors  can  be  dropped. 
The  number  of  the  radial  basis  functions,  N,  is  equal  to  the  number  of  the  reconstructed 
state  vectors,  each  of  which  is  the  center  of  one  of  the  basis  functions.  Casdagli  chose  a 
cubic  radial  function,  i.e.  p (/*)  = r , and  a linear  polynomial  (d=l)  as  the  approximation 
model  [Cas89].  Although  this  is  a global  approximation,  the  model  can  still  preserve  very 
well  local  properties  because  of  the  use  of  the  radial  basis  functions.  However,  the  disad- 
vantage is  that  the  computation  of  the  parameters  will  become  impractical  when  the  num- 
ber of  data  points  is  large. 

Moody  and  Darken  proposed  a neural  network  implementation  of  the  radial  basis 
function  approximation  [Moo89].  The  output  of  the  network  is  computed  by 

k 


<^(x) 


eq.  1.3-9 


Pj  (i) 


A 

where  x is  the  state  vector  (the  temporal  index  is  dropped),  x - and  P . are  the  center  and  the 

J J 

width  of  a gaussian-shape  radial  function,  and  w.  are  the  coefficients  or  connection 
weights  of  the  network.  The  number  and  the  locations  of  the  centers  of  the  basis  functions 
can  be  determined  from  the  distribution  of  the  state  vectors.  Mead  et  al.  modified  this  net- 
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work  by  including  an  extra  term  [Mea91],  The  output  of  the  network  becomes 


= Z [“"y 


4. 


Py  <^> 

Zp.w 


eq.  1.3-10 


where  w.  and  dj  are  adjustable  parameters  associated  with  the  j-th  basis  function.  The 

A A -A 

incorporation  of  the  linear  term  (x  - x ) • d-  relaxes  the  constraint  which  is  imposed  on  the 
adaptation  in  interpolating  the  point  x -.  Consequently,  the  number  of  the  basis  functions 
can  be  reduced  without  degrading  the  performance.  Mead  et  al.  trained  this  network  to 
predict  the  same  chaotic  time  series  as  used  in  the  work  of  Lapedes  and  Farber.  The  main 
drawback  is  that  the  training  of  this  network  may  become  unstable  occasionally. 

As  to  the  state-space  noise  reduction,  there  are  mainly  two  approaches.  One 
approach  is  to  use  the  difference  in  the  dimensionality  of  a deterministic  attractor  and  of 
random  noise  [Lan91].  Through  a projection  operation,  high-dimensional  noise  can  be 
removed.  The  other  approach  is  to  smooth  the  jaggedness  of  the  reconstructed  trajectory 
of  a signal,  which  is  caused  mainly  by  the  noise  [Kos90][Caw92].  Consequently,  the  noise 
components  can  be  reduced.  1 have  proposed  a method  combining  both  approaches  to 
reduce  noise  level  further  in  state  space  [Pri93b]. 

1 .4  Organization  of  Chapters 

In  Chapter  2,  after  a brief  review  of  nonlinear  dynamics  I will  discuss  the  proce- 
dures to  reconstruct  a state-space  trajectory  from  the  output  of  a dynamical  system.  The 
underlying  dynamics  of  a signal  can  thus  be  studied  and  quantified.  Then,  I will  discuss  a 
state-space  noise  reduction  technique  based  on  this  reconstruction.  At  the  end  of  the  chap- 
ter, the  existence  of  a dynamic  model  of  a deterministic  signal  is  justified.  And,  several 
methods  are  proposed  to  determine  the  design  parameters  in  dynamic  modeling  and  to  val- 
idate the  model. 


12 


In  Chapter  3, 1 first  examine  artificial  neural  networks,  the  nonlinear  modeling  tool 
that  I choose  for  this  research,  from  three  different  aspects:  activation  functions,  connec- 
tion structures,  and  learning  algorithms.  Different  memory  mechanisms  embedded  in  net- 
work structures  will  be  discussed  and  analyzed.  For  dynamic  modeling,  a global  feedback 
structure  is  proposed.  And,  to  train  a network  of  this  structure,  I introduce  and  compare 
several  trajectory  learning  algorithms,  including  the  one  that  I propose. 

I will  report  all  of  my  experimental  results  about  dynamic  modeling  in  Chapter  4. 
And,  Chapter  5 is  for  the  conclusions  and  the  future  directions  of  this  research. 


CHAPTER  2 

DYNAMICS  OF  SIGNALS 


For  a stable  linear  time-invariant  (LTI)  system,  all  of  the  state  trajectories  will 
relax  to  the  same  point  in  state  space.  Therefore,  the  study  of  the  steady  state  of  an  LTI 
system  is  trivial.  On  the  other  hand,  nonlinear  autonomous  systems  can  exhibit  a wide 
variety  of  dynamical  behaviors,  e.g.  limit  cycles,  tori,  and  even  fractals.  This  richness  in 
dynamical  behaviors  makes  the  modeling  of  nonlinear  dynamics  very  challenging. 

Given  the  state  equations  of  an  LTI  system,  we  usually  can  obtain  a closed-form 
solution  which  describes  the  evolution  of  the  system  states.  On  the  other  hand,  a nonlinear 
system  usually  can  not  be  solved  analytically.  A common  approach  is  to  use  a numerical 
method  to  integrate  the  system  equations  from  different  initial  conditions.  In  the  state 
space,  these  numerical  solutions  represent  the  state  trajectories  of  the  system.  The  dynam- 
ics of  the  system  are  then  studied  based  on  these  trajectories.  In  the  first  section  of  this 
chapter,  I will  give  a brief  review  of  nonlinear  dynamical  systems  and  introduce  two 
important  descriptors  that  have  been  used  to  quantify  a system  dynamics.  These  descrip- 
tors are  the  dimension  and  the  Lyapunov  exponents  of  the  system  attractors.  Since  these 
measurements  will  not  change  when  we  apply  any  diflfeomorphic  transformation  to  the 

system  dynamics,  they  are  referred  to  as  dynamical  invariants. 

In  most  real-life  problems,  equations  of  motion  are  unknown.  What  we  usually 
have  is  just  a sampled  output  signal  of  a dynamical  system.  Therefore,  the  first  step  to 
study  the  system  dynamics  is  to  reconstruct  a trajectory  from  the  signal.  In  section  2.2, 1 
will  discuss  two  reconstruction  methods  proposed  by  Packard  et  al.  [Pac80].  The  recon- 
struction using  these  methods  was  proved  to  be  able  to  preserve  the  measurements  of  the 
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dynamical  invariants.  In  this  section,  I also  propose  a reconstruction  method,  which  can  be 
implemented  as  part  of  data  acquisition.  I prove  that  this  reconstruction  procedure  will 
yield  a filtered  version  of  the  trajectory  reconstructed  from  one  of  the  methods  proposed 
by  Packard  et  al.  Under  a certain  condition,  the  proposed  method  can  preserve  the  dynam- 
ical invariants.  I also  show  that  the  method  is  robust  for  noisy  signals. 

In  section  2.3,  I will  review  the  algorithms  used  in  computing  the  dynamical 
invariants  from  a reconstructed  trajectory.  The  determination  of  the  parameters  for  these 
algorithms  will  be  discussed.  I also  suggest  some  modifications  for  one  of  the  algorithms 
to  eliminate  a pathological  phenomenon. 

The  trajectory  reconstruction  allows  us  to  study  the  system  dynamics  from  the  sys- 
tem output.  An  important  application  of  this  reconstruction  is  that  we  can  apply  the  differ- 
ences in  dynamical  behavior  between  a deterministic  signal  and  a random  process  to 
reduce  noise.  This  approach  can  be  regarded  as  a procedure  to  remove  noise  in  state  space. 
The  first  part  of  the  section  2.4  is  the  discussion  of  the  state-space  noise  reduction.  In  the 
second  part  of  the  section,  I show  that  based  upon  this  reconstruction  we  can  also  justify 
the  existence  of  a dynamic  model  for  a given  signal. 

In  the  last  section,  I will  discuss  the  importance  of  the  dynamical  invariants  in 
dynamic  modeling.  They  can  be  used  in  determining  some  design  parameters  (e.g.  the 
optimization  criterion  and  the  order  of  the  model)  and  validating  the  resulting  model.  The 
approaches  proposed  in  this  section  will  be  applied  to  the  experiments  reported  in  Chapter 
4. 


2 1 Review  of  Nonlinear  Dynamics 

An  n-th  order  autonomous  dynamical  system  can  be  represented  by  a set  of  Ordi- 
nary Differential  Equations  (ODEs),  which  is  referred  to  as  the  system  state  equations: 


dx 

dt 


(0  =/(^(0) 


eq.  2.1-1 
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where  x (t)  are  the  system  states,  and /is  called  the  vector  field.  The  vector  field/ maps  a 
manifold,  or  a state  space,  M to  a tangent  space  T.  If  / is  a linear  function  of  the  system 
states,  the  system  is  linear;  otherwise,  the  system  is  nonlinear.  Assume  there  is  a closed- 

form  solution  for  eq.  2.1-1,  For  a given  initial  condition  Xq,  the  function 

(j)  (Xq)  represents  a state-space  trajectory  of  the  system.  Thus,  the  mapping  is  called 
the  system  flow. 

Let  us  assume  that  the  vector  field / is  a c‘  (continuously  differentiable)  function, 
which  is  not  a restrictive  condition  for  most  real-life  systems.  In  dynamical  systems,  this 

assumption  will  ensure  the  existence  of  the  system  flow  (j)^  and  its  inverse  (j)^  for  any 

finite  time  [Hir74].  Consequently,  the  trajectories  of  an  autonomous  system  have  the  fol- 
lowing property; 

(j)  (x,)  = (j)  (x-,)  if  and  only  if  X,  = X2 

f 1.  I ^ 

where  Xj  and  X2  are  two  system  states.  This  implies  that  the  trajectories  of  an  autonomous 
system  will  never  cross  one  another. 

For  a stable  system,  the  system  trajectories  will  fall  into  the  neighborhoods  of 
some  finite-dimensional  limit  sets  after  transients  die  out.  These  limits  sets  are  referred  to 
as  system  attractors.  In  dynamic  modeling,  we  are  interested  in  only  the  steady-state 
behavior  of  a system. Therefore,  the  dynamics  in  the  neighborhood  of  a system  attractor  is 
of  the  most  importance. 

For  an  LTI  system,  a closed-form  solution  of  the  system  flow  (|)^  is  usually  avail- 
able. Therefore,  the  system  dynamics  can  be  easily  analyzed.  On  the  other  hand,  a closed- 
form  solution  of  a nonlinear  system  is  usually  not  obtainable.  The  study  of  the  system 
dynamics  are  mainly  based  upon  the  numerical  solutions  of  the  system  state  equations.  In 
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steady  state,  these  state-space  trajectories  (numerical  solutions)  can  be  used  to  represent 
the  dynamics  of  the  system  attractors. 

To  quantify  nonlinear  dynamics,  two  important  descriptors  have  been  introduced: 
the  dimension  and  the  Lyapunov  exponents  of  the  system  attractor  [Eck85],  The  dimen- 
sion of  the  system  attractor  is  the  irreducible  number  of  degrees  of  freedom  of  the  system, 
and  the  positive  (negative)  Lyapunov  exponents  are  the  measurement  of  the  growth 
(destruction)  rate  of  information  within  the  system  flow. 

Nonlinear  dynamics  can  exhibit  a wide  spectrum  of  dynamical  behaviors.  An 
attractor  of  a nonlinear  system  can  be  a limit  cycle,  a torus,  or  even  a fractal.  A fractal  is  a 
geometric  object  whose  dimension  is  not  an  integer  [Man82].  Having  a fractal  attractor  is 
one  of  the  characteristics  that  we  use  to  identify  chaotic  dynamics.  The  conventional  defi- 
nition of  dimension  can  not  be  applied  to  describe  a fractal.  In  the  next  subsection,  a gen- 
eralized dimension  will  be  defined  to  include  the  fractal  case. 

For  an  LTI  system,  the  way  in  which  the  system  flow  approaches  the  point  attrac- 
tor can  be  completely  described  by  the  eigenvectors  and  the  eigenvalues  of  the  system. 
The  eigenvalues  indicate  the  speeds  at  which  a trajectory  moves  towards  the  attractor 
along  the  directions  of  the  eigenvectors.  On  the  other  hand,  in  a nonlinear  regime  there  is 
no  system  eigenvalues  and  eigenvectors.  However,  the  concept  of  the  eigenvalue  was  gen- 
eralized, and  the  Lyapunov  exponents  were  introduced  [Eck85].  In  the  subsection  2.1.2, 
we  will  discuss  the  Lyapunov  exponents  in  details. 

The  Lyapunov  exponents  have  also  been  used  to  identify  a chaotic  system.  Since 
information  (uncertainty)  will  be  created  in  a chaotic  dynamics,  a chaotic  system  will  pos- 
sess at  least  one  positive  Lyapunov  exponent.  Primarily  due  to  this  characteristic,  a chaotic 
system  becomes  very  different  from  the  other  dynamical  systems. 
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2.1.1  Generalized  Dimension 

The  conventional  definition  of  dimension  can  be  stated  as  follows: 

A geometric  configuration  is  of  dimension  n if  n is  the  least  number  of  real- 
valued parameters  which  can  be  used  to  (continuously)  determined  the  points 
of  the  configuration;  i.e.,  if  there  are  n degrees  of  freedom,  or  the  configura- 
tion is  (locally)  topologically  equivalent  to  a subspace  of  n-dimensional 
Euclidean  space  (pp.  123)  [Jam92]. 

Since  a fractal  does  not  resemble  any  Euclidean  space  or  its  subspace,  this  definition  can 
not  be  applied  when  we  try  to  determine  the  dimension  of  a fractal  attractor.  There  are  sev- 
eral ways  to  generalize  the  definition  of  dimension  to  include  the  fractal  case.  The  most 
frequently  adopted  definition  is  so  called  generalized  dimension  [Gra83a],  which  is  given 
by 


lim 

1 - ^s^o/a/  (1/e) 


V(e) 

In  ^ FJ  (e) 

I = 1 


eq.  2. 1.1-1 


where  N(s)  is  the  number  of  small  volume  elements  with  radius  s required  to  cover  the 
entire  attractor,  and  Pi(s)  is  the  probability  for  a point  of  the  attractor  to  fall  in  volume  ele- 
ment i.  Among  different  values  of  q,  the  most  frequently  used  ones  are  0,  1,  and  2.  In  the 
literature,  Dq,  Dj,  and  D2  are  also  called  capacity  dimension,  information  dimension  and 

correlation  dimension  respectively. 

The  capacity  dimension  is  defined  by  setting  q equal  to  0 in  eq.  2. 1.1-1, 


lim  D 

0^  ^ 


/f/N(e) 
/aa  (l/s) 


eq.  2.1. 1-2 


The  idea  of  this  definition  is  to  use  volume  elements  with  radius  s to  cover  an  attractor.  As 

8 is  made  smaller,  the  sum  of  the  volume  elements  approaches  the  volume  of  the  attractor. 
The  number  of  volume  elements  required  to  cover  the  whole  attractor  is  proportional  to 


18 


(1/e)  For  instance,  it  requires  l/e  volume  elements  to  cover  a line  segment  of  unit 
length.  Thus,  according  to  this  definition  the  dimension  of  a line  segment  is  one.  This 
result  is  consistent  with  that  of  the  conventional  definition. 

Capacity  dimension  is  also  useful  to  illustrate  the  concept  of  non-integer  dimen- 
sion. Take  the  example  of  the  Cantor  middle-thirds  set.  The  Cantor  set  can  be  constructed 
as  follows;  Take  a straight  line  that  fills  the  unit  interval,  and  remove  the  inner  third.  Both 
the  remaining  short  lines  are  treated  in  the  same  way,  resulting  together  in  four  short  inter- 
vals, each  of  length  1/9.  The  process  of  removing  thirds  is  repeated  infinitely  many  times; 
the  remaining  structure  will  converge  to  the  Cantor  set.  This  construction  procedure  is 
depicted  in  Figure  2-1.  The  Cantor  set  is  a fractal.  To  compute  the  capacity  dimension  of 

the  Cantor  set,  we  can  choose  a line  with  length  e = 3 ^ as  the  volume  element  for  j-th 


stage  construction.  Then  we  have  N(s)  - 2^.  The  capacity  dimension  of  the  Cantor  set  is 


calculated  by 


lnN(e)  Inl 

lim = — 

ln3 


0.6309... 


eq.  2. 1.1-3 


Since  the  dimension  of  the  Cantor  set  must  be  between  0 (point)  and  1 (line),  this  estimate 
is  reasonable  by  intuition. 


Figure  2-1.  Construction  of  the  Cantor  set. 
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The  definition  of  the  capacity  dimension  is  important  conceptually.  However,  in 
simulation  or  experiment,  the  attractor  is  not  seen  directly.  Typically,  we  have  only  state 
trajectories  over  finite  time  periods.  For  an  accurate  calculation  of  Dq,  the  system  must  run 

long  enough  to  guarantee  that  the  trajectories  have  visited  almost  every  part  of  the  attrac- 
tor. The  required  time  interval  and  amount  of  data  is  often  impractically  large  [Gre82]. 

The  information  dimension  and  the  correlation  dimension  are  defined  in  terms  of 
the  distribution  of  system  states.  Therefore,  based  on  either  one  of  two  definitions,  we  can 
determine  the  dimension  of  an  attractor  without  using  an  exhaustive  box-counting  algo- 
rithm as  required  for  the  computation  of  the  capacity  dimension.  To  obtain  the  formula  of 
the  information  dimension,  we  set  q equal  to  1 and  apply  L’Hospital’s  rule  to  eq.  2. 1.1-1, 


V(e) 

D,  = lim  D = lim  ‘ ~ ^ — 

' 9->i  ^ 8-»o  ln{\/e) 


eq.  2. 1.1-4 


We  note  that  the  numerator  represents  the  entropy  (tiverage  self  information)  of  the  distri- 
bution of  the  system  states.  Therefore,  the  dimension  is  referred  to  as  information  dimen- 


sion. The  correlation  dimension  (q=2)  is  defined  as 

N(e) 
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D. 
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eq.  2. 1.1-5 


eq.  2. 1.1-6 


where  C(e)  is  called  correlation  integral,  N is  the  nuniber  of  state  points,  H()  is  the  Heavi- 
side  step  function,  and  are  system  states.  According  to  the  definition,  the  correlation 
integral  is  proportional  to  the  average  number  of  pairs  of  state  points  within  a spheroid  of 
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radius  e.  Grassberger  showed  that  the  following  relationship  exists  among  D^,  D^,  and 


£>2  < £>i  < Z) 


0 


eq.  2.1. 1-7 


Both  correlation  dimension  and  information  dimension  are  defined  in  probabilistic  way. 
There  is  no  reason  to  prefer  one  over  the  other.  However,  the  algorithm  developed  to  com- 
pute the  correlation  dimension  is  simpler  than  that  to  compute  the  information  dimension. 
Therefore,  the  correlation  dimension  has  been  widely  accepted  in  computing  the  dimen- 
sion of  a fractal  attractor.  In  this  research,  I also  adopt  this  definition. 

2 1.2  Lyapunov  Exponents 

Lyapunov  exponents  are  the  measurements  of  the  expansion  or  the  contraction  rate 
of  the  system  flow  in  the  neighborhood  of  an  attractor  [Eck85].  They  are  a generalization 
of  the  eigenvalues  in  linear  systems.  The  Lyapunov  exponents  can  be  defined  as  follows. 
Consider  an  n-th  order  dynamical  system.  We  monitor  the  long-term  evolution  of  an  infin- 
itesimal n-sphere  with  radius  s(0)  at  time  t=0.  The  sphere  will  become  an  n-ellipse  due  to 
the  locally  deforming  nature  of  the  flow.  The  i-tli  one-dimensional  Lyapunov  exponent  is 
then  defined  in  terms  of  the  length  of  the  ellipsoidal  principal  axis  8^.  (/)  : 

X.  = lim -log,, eq.  2. 1.2-1 

' "8(0) 

Here,  we  use  a based  two  logarithm  to  define  the  Lyapunov  exponents  in  bits/sec. 
If  the  natural  logarithm  is  used,  the  unit  becomes  nats/sec.  The  sign  of  a Lyapunov  expo- 
nent represents  the  divergence  (plus)  or  the  convergence  (minus)  of  the  system  flow  in  the 
corresponding  direction.  Since  the  divergence  (convergence)  of  nearby  trajectories  can 
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also  be  interpreted  as  the  creation  (destruction)  of  the  information  (randomness),  the 
Lyapunov  exponents  can  be  regarded  as  the  measurement  of  the  growth  or  the  destruction 


rate  of  the  information  in  the  system  flow. 

A chaotic  system  has  at  least  one  positive  Lyapunov  exponent.  In  other  words,  the 
nearby  trajectories  in  the  neighborhood  of  a chaotic  attractor  will  diverge  at  least  in  one 
direction.  This  property  will  result  in  the  sensitivity  of  the  system  dynamics  to  the  initial 
condition.  This  phenomenon  can  be  illustrated  in  the  following  example.  The  Lorenz 
equations  are  a model  established  in  weather  forecasting  [Lor63].  They  are  given  by 


X = o(y-x) 
y = x(r-z)  -y 

z - xy-bz 


eq.  2.1.2  -2 


where  o,  r,  and  b are  constants.  With  a = 10,  r = 28,  and  b-8/3,  the  system  exhibits  a 

chaotic  dynamic.  Let  us  compute  the  solutions  from  two  initial  conditions  with  minor  dif- 
ference (0.01).  In  Figure  2-2,  I plot  the  signals  of  x coordinates  of  both  solutions.  The 
divergence  of  both  signals  can  be  clearly  observed.  Therefore,  unless  the  initial  condition 
can  be  specified  with  infinite  precision,  the  dynamical  behavior  of  a chaotic  system 
becomes  unpredictable. 

In  general,  the  number  of  fundamental  frequencies  that  the  system  can  generate 
can  be  identified  from  the  number  of  the  Lyapunov  exponents  that  equal  to  zero.  For 
instance,  a limit  cycle  contains  a fundamental  frequency  and  has  one  zero  exponent;  a 
torus  contains  two  fundamental  frequencies  and  has  two  zero  exponents.  However,  when  a 
system  contains  positive  Lyapunov  exponents  the  output  of  the  system  will  have  a noise- 
like wideband  spectrum.  As  an  example,  the  waveform  and  the  spectrum  of  the  Lorenz 
signal  used  in  the  previous  example  are  shown  in  Figure  2-3.  We  note  that  although  the 
signal  is  almost  noise-free  (except  very  small  quantization  errors)  it  has  a spectrum  very 
similar  to  noisy  signals. 
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The  Lyapunov  exponents  can  also  be  related  to  the  dimension  of  the  attractor.  In 
fact,  Kaplan  and  Yorke  defined  a fractal  dimension  according  to  the  Lyapunov  exponents 


[Kap79],  The  formula  is  given  by 


- J + 


A. , + + . . . + A/ . 

t 2 y 


eq.  2. 1.2-3 


where  the  Lyapunov  exponents  are  indexed  in  descending  order,  i.e.  X.  >X.^^,  and  j is  the 

largest  integer  for  which  Xj  + ^2  + ...  + A.  ^ 0.  Since  this  dimension  is  based  on  the 

Lyapunov  exponents,  it  is  called  Lyapunov  dimension.  The  Lyapunov  dimension  is  con- 
jectured to  be  equal  to  the  information  dimension  [Gra83c].  Although  there  is  still  no 
proof  of  this  conjecture,  numerical  simulations  do  not  appear  to  contradict  this  conjecture. 


Figure  2-2.  X-coordinate  signals  of  the  Lorenz  equations  integrated  from 
two  initial  conditions  with  minor  difference. 
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Figure  2-3.  (a)  Waveform  and  (b)  spectrum  of  the  Lorenz  signal. 


2.2  Reconstruction  of  Underlying  Dynamics 
In  this  research,  the  dynamics  reconstruction  from  a signal  plays  an  important  role. 
The  reconstruction  not  only  provides  a theoretical  justification  for  the  dynamic  modeling 
but  also  enables  us  to  measure  the  underlying  dynamics  that  produced  the  signal.  In  exper- 
imental dynamical  systems,  Packard  et  al.  suggested  that  an  observable  of  a dynamical 
system,  x(t),  and  either  its  delayed  versions  (i.e.  x(t+x),  x(t+2x),...)  or  its  derivatives  (i.e. 
x'(t),  x"(t),...)  can  be  used  as  the  coordinates  of  the  points  on  a trajectory  in  an  Euclidean 
space  [Pac80].  In  other  words,  a state  vector  on  this  trajectory  at  time  t is  given  by 


(0  = x(t)  x(/  + 1) 


x(/  + 2/wx)] 


eq.  2.2-1 


or 


jc  (/)  x'  (/) 


eq.  2.2-2 


where  x is  a constant  delay,  2m+l  is  the  dimension  of  the  construction  space,  and  x''  (t) 
is  the  i-th  time  derivative  of  the  signal.  Packard  et  al.  were  able  to  show  numerically  that 
this  reconstruction  can  preserve  the  dynamical  invariants. 
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To  prove  this  result  in  mathematics,  let  us  assume  a measurement  function 
h:M^R, 


h(s{t))  = X (/) 


eq.  2.2-3 


where  5 (t)  is  the  original  system  state,  and  M is  the  manifold  of  the  original  system 
states.  Suppose  we  use  vector  >'j  (t)  to  reconstruct  a trajectory.  We  can  rewrite  eq.  2.2-1 
as 


Ti(0 


eq.  2.2-4 


where  <j>  :M— is  the  original  system  flow.  Let  us  define  the  reconstruction  map 


as 


T(5(0)- 


eq.  2.2-5 


= (0 


Takens’  Embedding  Theorem:  If  the  dimension  of  the  reconstruction  space  satisfies  the 
condition  2m  + \>2d+\,  the  reconstruction  map  'F  is  generically  an  embedding 

[TakSIJ. 


An  embedding  is  a smooth,  one-to-one  coordinate  transformation  with  a smooth 
inverse.  Or,  equivalently  an  embedding  is  a diffeomorphism.  If  'P  is  an  embedding,  a 

smooth  dynamics,  F,;  7?“"'  k , can  be  induced  and  is  given  by: 


-1 


Cv,('o))  = CViC'o)) 


eq.  2.2-6 
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where  • is  the  composition  of  functions.  We  note  that  is  the  flow  of  the  recon- 
structed dynamics.  Since  the  map  'F  is  diffeomorphic,  the  flow  can  preserve  the 

dynamical  invariants.  The  relationship  among  these  mappings  and  vectors  are  illustrated 
in  Figure  2-4. 

Based  upon  a similar  derivation,  Takens  proved  that  the  reconstruction  using 
}>2  (()  is  also  an  embedding. 


2.2. 1 Delay  Coordinate  Reconstruction 

A A 

Although  either  (t)  or  y2  (0  can  be  used  in  the  reconstruction,  most  research- 

A 

ers  prefer  (/) . This  trend  is  mainly  due  to  the  fact  that  it  is  usually  computational- 
expensive  and  noise-sensitive  to  estimate  all  time  derivatives  of  a signal  as  required  to 
obtain  y^y  (/) . The  reconstruction  procedure  using  (/)  is  referred  to  as  delay  coordinate 


method. 
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For  an  infinite  amount  of  noise-free  data,  the  time  delay  x used  in  the  delay  method 
can  in  principle  be  chosen  almost  arbitrarily.  However,  experiments  show  that  the  quality 
of  the  reconstructed  attractor,  which  will  affect  significantly  the  measurement  of  the 
dynamical  invariants,  depends  on  the  value  chosen  for  x.  Using  a small  delay  the  recon- 
structed trajectory  may  appear  to  lie  on  a diagonal  line  due  to  the  highly  correlated  coordi- 
nates of  the  state  vectors.  On  the  other  hand,  using  a large  delay  may  totally  distort  the 
reconstruction.  Therefore,  usually  we  select  the  smallest  x that  can  make  x(t)  and  x(t+x)  as 
uncorrelated  as  possible.  An  intuitive  choice  is  the  delay  that  yields  the  first  zero-crossing 

point  of  the  autocorrelation  between  x(t)  and  x(t+x).  Yet,  another  criterion  proposed  by 
Fraser  and  Swinney  is  to  use  average  mutual  information  function  of  x(t)  and  x(t+x) 
[Fra86].  They  suggested  to  select  the  delay  that  gives  the  first  local  minimum  of  the  aver- 
age mutual  information  function  to  reconstruct  the  trajectory.  According  to  Shannon's 
information  theory,  the  average  mutual  information  of  two  random  variables,  Q and  S,  is 
computed  by 

I{Q-S)  =H{Q)-H{Q\S)  eq.2.2.1-1 

with  H (Q)  = , and 

q,s 

where  p is  the  probability  of  Q = q,  />  is  the  joint  probability  of  Q = q and  S = s,  and 
/?^i  ^ is  the  conditional  probability  of  Q = q given  S = s.  H(Q)  is  called  the  entropy  of  the 

random  variable  Q.  It  represents  the  randomness  contained  in  the  variable  Q.  The  relation 

in  eq.  2.2. 1-1  can  be  interpreted  as  follows: 

the  average  amount  of  uncertainty  in  Q resolved  by  the  observation  of  the  random 
variable  S 

= the  average  amount  of  uncertainty  in  Q - 
the  average  remaining  uncertainty  in  Q after  the  observation  of  S 


27 


The  mutual  information  measures  the  general  dependence  of  two  variables;  there- 
fore, it  provides  a better  criterion  for  the  choice  of  x than  the  autocorrelation  function, 
which  only  measures  linear  dependence.  The  estimation  of  conditional  probabilities  from 
experimental  data  is  usually  very  difficult.  We  can  derive  another  expression  for  eq.  2.2.1- 
1 based  on  the  following  equality 


H(^S) 


q,s 


eq.  2.2. 1-2 


q,s 


Now,  we  can  rewrite  eq.  2.2. 1-1  as 

I{Q,S)  = H{Q)  +7/(5) -H{Q,S)  eq.  2.2.1-3 

= I{S-Q) 

The  second  equality  in  eq.  2.2. 1-3  shows  that  the  average  amount  of  the  information  that 
we  obtain  for  one  of  the  variables  by  observing  the  other  remain  the  same.  According  to 
eq.  2.2. 1-3,  we  only  need  to  estimate  the  probability  of  Q,  the  probability  of  S,  and  the 
joint  probability  of  Q and  S.  Since  we  want  to  estimate  the  mutual  information  function  of 

x(t)  and  x(t+x),  let  us  assign  x(t)  as  the  random  variable  Q and  x(t+x)  as  the  random  vari- 
able S. 

To  estimate  the  probabilities,  we  construct  a phase  portrait  from  x(t)  and  x(t+x). 
Then,  we  partition  the  x(t)-x(t+x)  plane  with  equal-distance  grids  into  small  cells.  The 
dynamical  range  of  x(t)  is  equally  divided  into  small  bins.  The  histogram  of  x(t)  is  the  esti- 
mate of  the  probability  p . Because  x(t)  and  x(t+x)  actually  represent  the  same  time  series, 

they  will  have  the  same  distribution.  Therefore,  we  have  H{Q)  = H (5)  . The  main  diffi- 
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culty  in  computing  the  mutual  information  function  is  in  estimating  p . Assume  there 

are  totally  N points  on  the  phase  portrait.  If  a cell  in  the  x(t)-x(t+x)  plane  of  size  has 
N points  in  it,  we  estimate  p to  be  (N  / N) . For  a given  number  of  data,  larger 

sq  ^ ^ ^ sq  ' 

cells  have  more  points,  and  hence  the  estimate  of  the  probability  is  more  accurate,  but  the 
estimates  of  p may  become  too  flat,  which  will  result  in  underestimating  I(Q;S).  On  the 

q^  s 

other  hand,  smaller  cells  let  one  follow  changes  in  p over  short  distances,  but  allow  the 

q^  s 

fluctuations  due  to  small  sample  size,  which  will  lead  us  to  overestimate  I(Q;S).  Fraser 
and  Swinney  proposed  a recursive  method  to  estimate  the  average  mutual  information,  in 
which  the  size  of  each  cell  is  tailored  to  the  local  situation  [Fra86],  However,  according  to 
my  experience,  the  estimation  of  the  mutual  information  function  is  very  reliable  with  a 
large  range  of  the  cell  size  selection.  Lo  also  reported  the  same  conclusion  [Lo90].  There- 
fore, in  this  work  the  size  of  the  cells  is  fixed  in  estimating  the  mutual  information  func- 
tion. 

In  Figure  2-5, 1 show  a trajectory  of  the  Lorenz  attractor  (a=16,  r=45.92,  b=4.0). 
The  X coordinates  of  the  trajectory,  also  shown  in  Figure  2-5,  is  used  as  the  signal  in  the 
reconstruction.  The  estimates  of  the  mutual  information  function  based  on  different  parti- 
tion sizes  are  given  in  Figure  2-6.  We  note  that  the  locations  of  the  first  local  minimum  in 

these  curves  are  very  consistent  (at  x=5).  The  autocorrelation  function  of  the  signal  is  also 
computed  and  shown  in  Figure  2-7.  We  may  choose  x equal  to  20  from  the  autocorrelation 
curve.  In  Figure  2-8, 1 show  the  trajectories  reconstructed  from  the  signals  with  x=l,  x=5, 
and  x=20.  The  trajectories  are  shown  from  three  different  perspectives  to  provide  better 
visualization.  We  note  that  the  trajectory  reconstructed  with  x=5  resembles  the  original 
trajectory  best.  For  x=l,  some  regions  of  the  trajectory  become  indistinguishable.  And,  for 
x=20,  the  trajectory  is  totally  distorted. 
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These  results  show  that  the  estimate  of  the  mutual  information  function  can  really 
yield  a proper  selection  of  t in  the  delay  coordinate  reconstruction. 


Figure  2-5.  (a)  Lorenz  attractor  and  (b)  time  series  of  its  x-coordinates. 


bits/  sec 


Figure  2-6.  Average  mutual  information 
functions. 


Figure  2-7.  Autocorrelation  function 
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Figure  2-8.  Reconstructed  trajectories  with  (a)(b)(c)  x=l,  (d)(e)(f)  x-5, 
and  (g)(h)(i)  x=20. 


2 2.2  Proposed  Reconstruction  Procedure 

Because  the  computation  of  the  high-order  derivatives  of  a time  series  is  very  sen- 
sitive  to  noise  and  the  sampling  rate  of  the  signal,  the  vector  J2  (0  is  used  only  in  the 
reconstructions  of  low-dimensional  trajectories  [Eis91],  However,  the  advantage  of  using 
derivatives  as  the  components  of  the  construction  vectors  is  that  these  components 
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become  physical  quantities.  And,  we  can  avoid  the  problem  of  selecting  x.  In  this  research, 
I proposed  another  construction  vector,  whose  elements  are  the  Poisson  moments  of  the 

Jk. 

signal  [Pri93c],  This  construction  vector  can  be  related  to  the  vector  ^2  (0  » but  its  ele- 
ments can  be  easily  obtained  by  sampling  a multi-output  filter,  called  the  Poisson  Filter 
Chain  (PFC)  [Sha79], 

The  k-th  order  Poisson  Moment  Function  (PMF)  of  a signal  at  time  /q  is  computed 

by  convolving  the  signal  with  the  Poisson  distribution  function  (/)  [Zem87],  or 

^0 

A:  = 0,  1,2,3....  eq.2.2.2-1 

0 

Pj^(0  = 

where  A,  is  a constant.  This  computation  can  be  implemented  by  sampling  the  output  of  a 
filter,  whose  impulse  response  is  P^(0  , with  x(t)  as  the  input  signal.  The  transfer  func- 
tion of  this  filter  is 


eq.  111-1 


This  transfer  function  suggests  a cascaded  structure,  which  is  called  Poisson  Filter  Chain. 
A block  diagram  of  this  filter  chain  is  shown  in  Figure  2-9.  The  i-th  order  Poisson  moment 
of  the  signal  x(t)  can  be  directly  sampled  at  the  i-th  tap  of  this  filter  chain.  We  note  that 
this  filter  has  single  pole  located  at  -X,.. 

The  construction  vector  that  I proposed  is  as  follows: 


x*Pq(0  x*Pj  (0  ... 


eq.  2. 2. 2-3 


where  * denotes  the  convolution  operation.  The  i-th  element  of  this  construction  vector  is 
the  i-th  order  Poisson  moment  of  the  signal.  Therefore,  we  can  obtain  this  construction 
vector  by  sampling  the  output  of  each  tap  of  a (2m+l)-th  order  PFC.  An  immediate  advan- 
tage of  this  implementation  is  that  we  can  combine  the  data  acquisition  and  the  trajectory 
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Figure  2-9.  Poisson  Filter  Chain 


reconstruction  in  one  process.  This  proposed  reconstruction  vector  can  be  related  to  the 
construction  vector  J2  (0  by  applying  an  important  property  of  the  Poisson  moment  func- 
tion, which  states  that  the  PMF  of  the  time  derivatives  of  a signal  x(t)  can  be  expressed  in 


terms  of  the  PMFs  of  the  signal,  or 

tn 


- J 


0 


d' 


dt' 


:x(t) 


Pl^(tQ-t)dt 


eq.  2. 2. 2-4 


Based  upon  this  property,  the  following  relationship  can  be  established  between  >'2  (0 


and 


eq.  2. 2. 2-5 


where  A 
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Matrix  A represents  a rotation  operation,  and  F[  ] is  a filtering  process  by  a (2m+l)-th 
order  PFC.  In  other  words,  after  a rotation  the  proposed  construction  vector  is  a filtered 
version  of  the  vector  ^2  (0  • Since  this  filtering  process  is  equivalent  to  attaching  a linear 
system  of  a single  pole  with  multiplicity  to  the  underlying  system,  we  can  expect  that  the 
number  of  the  Lyapunov  exponents  will  increase.  However,  according  to  the  Kaplan- 
Yorke  formula  of  the  Lyapunov  dimension  in  eq.  2. 1.2-3,  the  dimension  of  the  attractor 
will  not  change  when  the  pole  of  the  filter  is  smaller  than  j [Bad88].  Therefore,  with  a 
proper  selection  of  A,  a state-space  reconstruction  using  vector  (f)  will  not  change  the 
measurements  of  the  original  dynamical  invariants. 

The  pole  of  a PFC  is  equal  to  -X.  To  preserve  the  dynamical  invariants,  the  condi- 
tion -A,  < X.  , must  be  satisfied.  Therefore,  -A<.^  j is  the  lower  bound  of  A..  However,  to 

J J 

decide  A,.^  j , we  need  to  estimate  the  whole  spectrum  of  the  Lyapunov  exponents.  This 

> 

estimation  is  very  difficult  for  experimental  data.  Therefore,  I usually  select  the  value  of  A, 
according  to  the  power  spectrum  of  the  signal  x(t).  With  a large  A,,  the  bandwidth  of  a PFC 
becomes  broader.  The  selection  of  X should  yields  a bandwidth  that  covers  the  most  dom- 
inant frequency  components  of  the  signal.  However,  while  we  choose  a large  A,  to  obtain  a 
required  bandwidth  we  may  find  that  the  signal  will  be  attenuated  severely.  To  compensate 
this  attenuation,  we  can  put  a gain,  proportional  to  the  value  of  A,,  in  each  building  block 
of  a PFC.  With  this  modification,  the  transfer  function  of  a k-th  order  PFC  becomes 


r ^ 1 


eq.  2.2. 2-6 


This  modification  will  not  change  the  relationship  between  >'2  (/)  and  >>3  (/)  given  in  eq. 
2. 2. 2-5,  except  that  the  matrix  A needs  to  be  modified.  But  the  new  matrix  is  stilt  nonsin- 
gular. Therefore,  we  can  preserve  the  measurement  of  the  dynamical  invariants  in  a recon- 
struction using  this  modified  PFC. 

A discrete-time  version  of  the  PFC  was  proposed  in  the  study  of  ANN  memory 
structures  and  was  named  gamma  filter  [deV92].  The  structure  of  the  gamma  filter  is 
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shown  in  Figure  2-10.  Comparing  the  transfer  functions  of  the  building  blocks  in  the 
gamma  filter  and  the  PFC,  we  note  that  the  gamma  filter  is  a first-order  difference  approx- 
imation of  the  PFC.  Therefore,  the  gamma  filter  can  be  used  to  compute  the  PMFs  of  a 
signal  in  digital  simulation.  Besides  the  computation  of  the  PMFs,  there  are  several  impor- 
tant applications  of  the  gamma  filter  in  this  research.  I will  discuss  the  properties  of  this 
filter  in  Chapter  3. 


Figure  2-10.  Gamma  filter 


To  verify  that  this  construction  method  can  preserve  the  dynamical  invariants,  the 
following  experiments  were  conducted.  I integrated  the  Lorenz  equation  (o=16,  r=45.92, 
b=4.0)  using  a 4-th  order  Runge-Kutta  method  with  step  size  0.02.  The  x variable  of  the 
solution  was  chosen  as  the  test  signal.  The  signal  has  totally  3000  samples.  Two  trajecto- 
ries are  reconstructed  in  the  6-dimensional  Euclidean  space  by  using  the  delay  coordinate 
method  with  x=5  and  my  proposed  method.  A gamma  memory  is  used  in  the  simulation  to 
compute  the  PMFs  of  the  signal.  The  parameter  p in  the  gamma  filter  has  the  same  rote  as 
the  X,  in  the  PFC.  The  value  of  the  p is  set  to  0.7;  in  other  words,  the  pole  of  the  gamma  fil- 
ter is  located  at  0.3  in  the  z-plane.  With  this  selection,  most  signal  power  will  be  included 
in  the  passband  of  the  gamma  filter. 
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The  largest  projection  of  both  trajectories  in  a 2-dimensional  plane  are  shown  in 
Figure  2-11.  They  look  very  similar.  To  quantify  this  similarity,  we  can  measure  the 
dimensionality  of  both  trajectories.  Since  we  have  not  discussed  the  algorithm  for  this 
measurement,  I will  only  show  the  results  without  explaining  the  algorithm.  In  the  next 
section,  this  algorithm  will  be  introduced  in  detail. 

I used  the  Grassberger-Procaccia  algorithm  to  estimate  the  correlation  dimensions 
of  both  trajectories  [Gra83b].  The  results  in  Figure  2-12  show  that  the  estimate  of  the 
dimension  is  2.04  for  the  trajectory  from  the  delay  coordinate  method  and  2.04  for  the  tra- 
jectory from  my  proposed  method.  They  are  the  same.  These  results  show  that  the  pro- 
posed reconstruction  procedure  can  really  preserve  the  measurements  of  the  dynamical 
invariants. 

To  verify  if  we  can  still  compute  the  correlation  dimension  when  the  signal  is 
noisy,  I added  a gaussian  white  noise  to  the  test  signal  with  signal-to-noise  ratio  equal  to 
30  dB.  The  trajectories  reconstructed  from  this  noisy  signal  are  given  in  Figure  2-13.  We 
note  that  the  trajectory  reconstructed  by  using  the  delay  method  is  more  jagged.  And,  the 
results  in  Figure  2-14  show  that  the  GP  algorithm  fails  in  the  dimension  estimation  of  the 
trajectory  from  the  delay  method,  while  the  dimension  can  still  be  computed  for  the  trajec- 
tory from  my  proposed  method.  Thus,  we  may  conclude  that  the  proposed  method  is  more 
robust  than  the  delay  coordinate  method  in  the  reconstruction  of  underlying  dynamics 
from  noisy  signals.  However,  due  to  the  noise  effect,  this  dimension  estimate  (2.18)  is  a 
little  higher  than  the  previous  estimation. 
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Figure  2-11.  2-D  projections  of  the  trajectories  reconstructed  by  using  (a) 
delay  coordinate  method,  and  (b)  proposed  method. 


Figure  2-12.  Correlation  dimension  estimates  (a)(c)  for  the  trajectory  from  the 
delay  method,  and  (b)(d)  for  the  trajectoiy  from  my  proposed  method. 
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Figure  2-13.  2-D  projections  of  the  trajectories  reconstructed  by  using 
(a)  delay  coordinate  method,  and  (b)  proposed  method,  (for  noisy  signal) 


Figure  2-14.  Correlation  dimension  estimates  (noisy  signal)  (a)(c)  for  the  trajectory 
from  the  delay  method,  and  (b)(d)  for  the  trajectory  from  my  proposed  method. 
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2 3 Estimation  of  Dynamical  Invariants 

2 3.1  Estimation  of  Correlation  Dimension 

To  compute  the  dimensionality  of  a reconstmcted  trajectory,  I use  an  algorithm 
proposed  by  Grassberger  and  Procaccia,  which  is  also  called  the  GP  algorithm  [Gra83b]. 
Given  a signal,  this  algorithm  can  yield  an  estimate  of  the  correlation  dimension  of  its 
reconstructed  trajectory.  The  GP  algorithm  has  been  widely  accepted  in  the  study  of 
experimental  chaos  due  to  its  moderate  requirements  of  computing  time  and  memory.  In 
this  section,  I will  review  this  algorithm  and  propose  a modification  to  improve  its  robust- 
ness. 


According  to  the  GP  algorithm,  we  compute  the  correlation  integral  first,  which  is 


given  by 


N N 


C(s) 


1 

N‘ 


y y tf(e-  y,-y^  ) 


eq.  2.3. 1-1 


/ = ly  = 1 


Note  that  we  lift  the  limit  from  eq.  2.9  because  the  number  of  the  data  samples,  N,  are 
always  finite  in  an  experimental  setting,  and  now  represents  a reconstructed  state  vec- 
tor. The  value  of  the  correlation  integral  is  proportional  to  the  average  number  of  pairs  of 
the  points  within  a spheroid  with  radius  e.  Thus,  to  compute  the  correlation  integral  from  a 

reconstructed  trajectory,  we  just  count  the  number  of  points  located  inside  the  s-neighbor- 
hood  of  each  state  vector,  and  average  the  count.  Once  we  estimate  the  correlation  inte- 
grals for  different  e’s,  the  correlation  dimension  is  the  slope  of  the  curve  of  InC(e)  v.s. 
ln(e),  which  is  called  Correlation  Integral  Map  (or  CIM  for  short).  The  idea  behind  the  GP 
algorithm  is  that  when  we  reconstruct  the  trajectory  in  a space  of  dimension  higher  than  a 
certain  value,  the  slope  of  the  CIM  curves  will  saturate  and  become  a constant  in  a scaling 
region  of  8.  This  slope  is  the  estimate  of  the  correlation  dimension.  In  Figure  2-12,  I 
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already  showed  the  CIM  curves  and  their  slope  estimates  for  the  Lorenz  signal.  The 
dimension  of  the  reconstruction  space  is  increased  from  2 to  14  with  an  increment  of  2. 
We  note  that  the  saturation  slope  is  2.04,  which  is  the  estimate  of  the  correlation  dimen- 
sion of  the  Lorenz  attractor.  This  estimate  is  very  close  to  2.07  reported  in  the  literature 
[Wol85]. 

Using  a finite  number  of  data  samples  in  the  reconstruction,  we  may  sometimes 
observe  a knee  on  the  CIM  curves  in  the  region  of  intermediate  e.  The  occurrence  of  the 
knee  make  the  estimation  of  the  correlation  dimension  extremely  difficult  because  the 
slope  of  the  curves  may  never  saturate  in  this  case.  There  are  two  sources  that  may  cause 
this  knee  phenomenon[Eck85][The86][Kuo91].  The  first  one  is  the  noise  that  contami- 
nated the  signal.  Because  random  noise  has  infinite  number  of  degrees  of  freedom,  the 
dimension  of  random  noise  is  said  to  be  infinite.  The  attractor  reconstructed  from  a noisy 
signal  will  become  jagged.  To  reflect  the  dynamics  of  the  noise,  this  jaggedness  will  result 

in  an  overestimated  correlation  integral  in  the  region  of  small  8.  When  we  increase  8 over 
a certain  value,  the  influence  of  the  noise  on  the  computation  of  the  correlation  integral 
will  diminish.  Consequently,  the  increase  rate  of  the  correlation  integral  will  drop,  and  a 
knee  is  thus  created. 

This  explanation  seems  quite  plausible,  and  in  some  cases  the  knee  can  actually  be 
flatten  if  we  filter  the  construction  signal  first.  However,  the  knee  can  still  be  observed 
sometimes  even  though  the  signal  just  contains  a small  quantization  error.  This  is  illus- 
trated in  Figure  2-15.  Again  we  use  the  example  of  the  Lorenz  signal  but  with  a different 
set  of  coefficients:  a=10,  r=28,  and  b=8/3.  The  signal  is  sampled  at  200  Hz,  and  there  are 
totally  7000  samples  in  the  computation.  The  amplitude  of  the  signal  is  multiplied  by  a 
factor  of  100  and  quantized  as  integers  to  speed  the  computation  of  the  correlation  inte- 
gral. The  dimension  of  the  reconstruction  space  is  6 and  the  delay  coordinate  method  is 
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used  with  x=35.  We  note  that  there  is  a rapid  increase  in  the  slope  of  the  CIM  curve  in  the 
region  ln(e)  = [3,4],  This  signal  contains  only  small  quantization  errors,  and  the  knee 
occurs  in  a region  of  e much  higher  than  the  error  level.  Thus,  a second  hypothesis  is  pro- 
posed. 


Figure  2-15.  Knee  in  a CIM  curve  for  the  Lorenz  signal  of  7000  points 

The  reconstructed  trajectory  can  be  regarded  as  a sampled  version  of  a continuous 
trajectory.  It  is  sampled  uniformly  in  time  but  not  in  space.  In  other  words,  the  parts  of  the 
trajectory  reconstructed  from  the  data  segments  having  slowly-changing  waveforms  may 
be  over-sampled,  while  the  others  corresponding  to  the  data  segments  having  fast-chang- 
ing waveforms  are  poorly  represented.  This  may  introduce  some  artificial  correlation  into 
the  computation  of  the  correlation  integral.  If  a long  data  segment  is  available  for  the 
reconstruction,  the  effect  of  the  artificial  correlation  can  be  ignored  because  the  recon- 
structed trajectory  will  visit  every  portion  of  the  attractor  repeatedly.  However,  a long 
recording  is  sometimes  not  obtainable  in  an  experimental  setting.  In  addition,  since  the 

time  complexity  of  the  GP  algorithm  is  S(N^)  the  computation  time  for  the  correlation 
integral  will  increase  dramatically  with  the  length  of  the  data  segment. 
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To  reduce  the  effect  of  the  artificial  correlation,  Theiler  suggested  that  in  the  com- 
putation of  the  number  of  points  within  a spheroid  of  a reference  point  we  should  skip  a 
certain  number  of  neighbors  that  are  close  in  time.  Thus,  we  can  rewrite  eq.  2.3. 1-1  as 


C(E) 


N N-J 


j = H-/  = 1 


eq.  2.3. 1-2 


where  w is  the  number  of  points  that  we  skip.  Theiler  also  recommended  a lower  bound 
for  w according  to  the  autocorrelation  time  of  the  signal,  the  total  number  of  data  samples, 
and  the  dimension  of  the  reconstruction  space. 

According  to  Theiler's  modification,  close-in-time  neighbors  are  always  not  taken 
into  account  in  the  computation  of  the  correlation  integral.  The  artificial  correlation  for 
those  oversampled  regions  of  a reconstructed  trajectory  can  thus  be  reduced.  However, 
this  procedure  may  also  distort  the  estimation  of  correlation  integrals  in  the  undersampled 
regions  of  the  trajectory.  These  portions  of  the  trajectory  are  already  poorly  represented. 
To  skip  the  neighboring  points  in  these  regions  when  computing  the  correlation  integral 
may  result  in  an  underestimate  of  dimensions. 

To  reduce  the  effect  of  the  artificial  correlation,  I propose  two  thresholds  in  deter- 
mining if  a neighboring  point  should  be  removed  from  the  computation  of  the  correlation 
integral.  First,  we  compute  the  average  angle,  between  two  vectors,  one  (Vj)  formed 

by  the  current  point  (on  the  reconstructed  trajectory)  and  the  first  next  point,  and  the  other 
(V2)  formed  by  the  current  point  and  the  second  next  point.  This  average  angle  will  be 

used  as  a threshold  in  determining  if  the  first  neighboring  point  is  located  at  a region  of  the 
trajectory  with  folding  corresponding  to  the  abrupt  changes  in  amplitude  in  a data  seg- 
ment. This  region  is  usually  the  location  where  a reconstructed  trajectory  is  poorly  repre- 
sented. Second,  we  calculate  the  correlation  integral  using  the  original  data  set.  The  radius 
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where  the  knee  occurs  will  be  used  as  the  threshold  of  distance,  . Then  we  go 

through  the  reconstructed  trajectory  and  remove  the  first  neighbor  point  permanently  if  the 
following  conditions  are  satisfied 

eq.  2.3. 1-3 

and  ^V|  ^knee 

where  (p  is  the  angle  between  v,  and  v.,,  and  5 is  the  length  of  v, . The  angular 

VjV2  1^  vj  i 

threshold  will  preserve  the  points  in  the  undersampled  regions,  and  the  distance  threshold 
will  remove  the  artificial  correlation  in  the  oversampled  regions. 

This  procedure  differs  from  the  Theiler's  method  in  three  aspects.  First,  an  angular 
threshold  is  added  to  preserve  the  number  of  the  points  in  those  poorly  represented 
regions.  Second,  instead  of  using  the  autocorrelation  function  of  the  signal,  the  distance 
threshold  is  directly  determined  from  the  radius  where  the  knee  occurs.  Third,  in  Theiler's 
method  the  nearest  neighboring  points  are  not  actually  removed;  we  just  do  not  count 
them  in  computing  the  number  of  points  in  the  neighborhood  of  a reference  point.  Still, 
every  point  on  the  reconstructed  trajectory  will  be  used  as  the  reference  point  in  the  com- 
putation of  the  correlation  integral.  On  the  contrary,  in  my  proposed  method  the  neighbor- 
ing points  that  meet  the  criteria  are  removed  from  the  reconstructed  trajectory 
permanently.  Thus,  the  total  number  of  points  used  in  the  computation  of  the  correlation 
dimension  can  be  actually  reduced. 

My  proposed  method  has  been  applied  to  both  model  signals  and  experimental  sig- 
nals and  been  shown  to  be  superior  to  Theiler's  method  [Kuo91].  In  Figure  2-16, 1 com- 
pare several  methods  to  reduce  the  knee  effect  in  estimating  the  correlation  dimension  of 
the  Lorenz  attractor.  In  Figure  2- 16(a),  I show  the  result  of  using  more  data  samples  (9000 
points).  The  improvement  is  very  limited.  For  the  result  of  the  Theiler’s  method,  I select  w 
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equal  to  1.  In  Figure  2- 16(b),  the  amplitude  of  the  knee  is  reduced,  but  the  region  of  the 
knee  is  extended.  The  result  of  my  proposed  method  is  shown  in  Figure  2-16(c).  The  dis- 
tance threshold  is  set  to  3.8  (in  natural  logarithm),  the  angular  threshold  is  2.15°.  The 
result  clearly  shows  that  the  knee  effect  is  reduced.  This  can  also  be  observed  in  the  slope 
estimation  of  the  CIM  curves,  as  shown  in  Figure  2- 16(d). 


Figure  2-16.  CIM  curves  for  (a)  9000-point  data  segment,  (b)7000-point  data  seg- 
ment with  Theiler  correction,  (c)  7000-point  data  segment  with  my  proposed 
method,  and  (d)  the  slope  estimation  of  the  CIM  curves.  (1)  7000-point  data  seg- 
ment, (2)  9000-point  data  segment,  (3)  7000-point  data  segment  with  Theiler  cor- 
rection, and  (4)  7000-point  data  segment  with  my  proposed  method. 
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2.3.2  Estimation  of  Maximal  Lvapunov  Exponent 

For  non-chaotic  dynamics,  there  exist  only  non-positive  Lyapunov  exponents.  The 
number  of  the  exponents  whose  values  equal  to  zero  represents  the  number  of  the  funda- 
mental frequencies  contained  in  the  dynamics.  In  general,  the  power  spectrum  of  the  sys- 
tem outputs  can  reveal  this  information.  On  the  other  hand,  a chaotic  system  is 
characterized  by  its  positive  Lyapunov  exponents.  The  positive  Lyapunov  exponents  can 
not  be  identified  from  the  power  spectrum. 

In  the  section,  I will  review  an  algorithm,  proposed  by  Wolf  et  al.,  to  compute  pos- 
itive Lyapunov  exponents  from  a time  series  [Wol85].  This  algorithm  is  straightforward 
and  robust  especially  for  the  estimation  of  the  maximal  exponent.  In  this  research,  the 
measurement  of  the  largest  positive  Lyapunov  exponent  is  extremely  important.  For  a sta- 
ble LTI  system,  the  system  flow  will  approach  the  attractor  along  the  direction  of  the 
eigenvector  corresponding  to  the  largest  eigenvalue  (except  those  trajectories  whose  initial 
conditions  are  located  on  the  extension  of  the  other  eigenvectors).  Similarly,  in  a chaotic 
dynamic  the  stretch  in  the  direction  of  the  maximal  divergence  will  eventually  dominate 
the  evolution  of  a volume  in  the  system  flow.  In  other  words,  the  divergence  of  two  nearby 
trajectories  is  mainly  determined  by  the  maximal  Lyapunov  exponent.  In  dynamic  model- 
ing, the  maximal  Lyapunov  exponent  is  an  important  indicator  about  how  far  the  model 
can  be  used  in  predicting  the  original  chaotic  signal  [Cas89]. 

In  this  work,  I will  measure  only  the  largest  Lyapunov  exponent  to  decide  the 
design  parameters  of  a dynamic  model  and  validate  the  model.  This  is  not  only  because 
the  largest  Lyapnunov  exponent  is  the  measurement  of  the  dominant  divergence  in  the  sys- 
tem flow,  but  also  because  there  is  still  no  algorithm  that  can  reliably  estimate  all  of  the 
Lyapunov  exponents  from  a time  series.  Most  of  the  algorithms  available  today  yield  only 
the  estimates  of  positive  exponents  [San85][Eckm86][Wol85].  Some  of  these  algorithms 
require  the  establishment  of  the  local  linear  models  for  a reconstructed  trajectory 
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[San85][Eckm86],  The  models  are  then  used  to  compute  the  Lyapunov  exponents.  Thus, 
the  success  of  these  algorithms  relies  heavily  on  how  well  these  local  models  can  repre- 
sent the  reconstructed  dynamics.  On  the  other  hand,  the  algorithm  proposed  by  Wolf  et  al. 
is  more  straightforward.  It  estimates  the  Lyapunov  exponents  directly  from  the  recon- 
structed trajectory. 

According  to  this  algorithm,  we  use  the  reconstructed  trajectory  as  a “fiducial”  tra- 
jectory. The  maximal  positive  Lyapunov  exponent  is  the  measurement  of  the  divergent 
rate  of  two  nearby  trajectories.  Thus,  what  we  try  to  measure  is  the  divergence  of  a nearby 
trajectory  from  the  fiducial  trajectory.  Suppose  we  start  from  a point  on  the  fiducial  trajec- 
tory. Since  we  actually  have  only  one  reconstructed  trajectory  from  a given  signal,  we  will 
search  for  a neighboring  point  on  another  segment  of  the  trajectory  and  treat  the  point  as 
the  initial  point  of  a nearby  trajectory.  The  divergent  rate  between  the  fiducial  trajectory 
and  this  segment  of  the  trajectory  can  be  computed  by 


eq.  2. 3. 2-1 


where  d (L)  is  the  initial  distance  between  those  two  neighboring  points,  and  d (/)  is  the 

separation  between  both  segments  after  t- time  steps.  As  we  mentioned  above,  the 

direction  of  the  divergence  between  two  nearby  trajectories  will  eventually  follow  the 
direction  of  the  maximal  principal  axis  of  an  element  volume  in  the  system  flow.  Thus,  the 
computation  in  eq.  2. 3. 2-1  can  be  regarded  as  an  estimate  of  the  maximal  Lyapunov  expo- 
nent. 

However,  since  the  largest  separation  of  two  nearby  trajectories  will  be  eventually 
limited  by  the  radius  of  the  attractor,  the  estimate  in  eq.  2. 3. 2-1  will  be  equal  to  zero  if  we 
let  these  two  trajectories  evolve  long  enough.  Therefore,  after  a certain  evolution  time,  say 
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EVOLV  seconds,  we  need  to  examine  if  the  separation  of  the  fiducial  trajectory  and  its 
“nearby  trajectory”,  d , is  larger  than  a certain  level  which  will  be  denoted  by  SCALMX. 


d > SCALMX,  we  need  to  find  another  neighboring  point  and  restart  computing  the 
separation  rate;  otherwise  we  keep  using  the  same  segment  as  a nearby  trajectory  in  the 
computation.  This  replacement  is  illustrated  in  Figure  2-17.  The  estimate  of  the  maximal 

Lyapunov  exponent  then  becomes  the  average  of  the  Vs  in  eq.  2. 3. 2-1  and  is  computed  by 
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eq.  2. 3. 2-2 


where  M is  the  total  number  of  replacements,  and  (j,  = k- EVOLV  seconds.  A rule  of 

thumb  suggested  by  Wolf  et  al.  is  selecting  the  evolution  time,  EVOLV,  equal  to  0.5  to  1.5 
times  of  the  system  orbit  time.  However,  the  mean  orbit  time  for  a chaotic  dynamics  is  not 
clearly  defined.  Principe  and  Lo  proposed  to  use  the  median  frequency  estimated  from  the 
power  spectrum  as  the  mean  orbit  time[Pri91]. 

In  a replacement,  we  need  to  preserve  the  orientation  in  the  measurement.  This  is 
because  we  want  to  measure  the  long-term  behavior  of  a single  principal  axis.  Therefore, 
during  each  replacement  several  candidate  neighboring  points  are  picked  up  and  the  one 
that  satisfies  the  following  two  criteria  best  will  used  as  the  next  initial  neighboring  point. 
These  two  criteria  are  stated  as  follows: 

1) .  the  separation  of  this  new  initial  point  from  the  evolved  fiducial  point,  <^(t^_  j ) , 

is  small,  and 

2) .  the  angular  separation  between  the  evolved  and  the  replacement  elements,  0^,  is 

small. 

If  the  signal  used  in  the  reconstruction  is  noisy,  the  selection  of  the  initial  separa- 
tion must  be  larger  than  the  noise  level.  Principe  and  Lo  suggested  that  this  threshold  can 
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be  decided  by  observing  the  location  of  the  knee  on  the  CIM  curves  (see  the  previous  sec- 
tion). If  the  knee  is  caused  by  the  noise,  the  radius  e where  this  knee  occurs  can  be  used  as 
the  threshold,  which  is  denoted  as  SCALMN.  They  also  suggested  that  the  value  of 
SCALMX  can  be  derived  from  the  upper  bound  of  the  scaling  region  in  the  CIM  curves, 
which  is  related  to  the  diameter  of  the  attractor  [Pri91],  In  this  work,  I will  choose  these 
parameters  (EVOLV,  SCALMN,  SCALMX)  according  to  the  suggestion  given  by  Princ- 
ipe and  Lo. 


I 


Figure  2-17.  Pictorial  representation  of  the  algorithm  proposed  by  Wolf  et  al. 

This  algorithm  can  be  extended  to  the  measurement  of  any  positive  Lyapunov 
exponents.  For  instance,  we  can  compute  the  expansion  rate  of  an  area  formed  by  a point 
on  the  fiducial  trajectory  and  another  two  points  on  two  separate  nearby  trajectories.  This 
divergent  rate  is  equal  to  the  sum  of  the  two  largest  Lyapunov  exponents,  X,  + A,-.  There- 
fore, we  can  compute  A,,  A,,  A,, ...  by  measuring  the  expansion  rate  of  the  distance,  the 
area,  the  volume,...,  and  so  on. 

As  an  example,  I compute  the  maximal  Lyapunov  exponent  for  the  Lorenz  signal 
(a=16,  r=45.92,  b=4)  sampled  at  50  Hz.  A trajectory  is  reconstructed  from  the  signal  in  R° 
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using  the  delay  method  with  x = 3.  The  signal  has  a 1/f  spectrum  as  shown  in  Figure  2-18. 
The  median  frequency  is  2.5  Hz.  Therefore,  I choose  EVOLV  equal  to  0.4  seconds.  The 
SCALMX  is  selected  to  be  30  % of  the  maximal  8 in  the  CIM  curves,  which  is  the  upper 
bound  of  the  scaling  region  as  shown  in  Figure  2- 12(c).  Since  the  signal  only  contains 
small  quantization  errors,  1 set  the  SCALMN  equal  to  1%  of  the  maximal  s.  The  curve  of 
the  run-time  estimation  of  the  exponent  is  shown  in  Figure  2-19.  The  estimate  of  the  max- 
imal Lyapunov  exponent  is  2.20  ± 0.03  bits/sec,  which  is  computed  based  upon  the  aver- 
age of  the  last  20  run-time  estimates.  This  estimate  is  close  to  the  value  of  2.16  bits/sec 
reported  by  Wolf  et  al. 


Figure  2-18.  Power  spectrum  of  the  Lorenz  signal. 
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Figure  2-19.  Run-time  estimates  of  the  maximal  Lyapunov  exponent. 

2.4  Noise  Reduction  and  Dynamic  Modeling  from  the  Reconstruction 

The  reconstruction  procedures  not  only  enable  us  to  quantify  the  dynamical  prop- 
erties of  the  underlying  system,  but  also  provide  two  important  applications.  One  is  the 
state-space  noise  reduction,  and  the  other  is  the  dynamic  modeling.  The  two  topics  will  be 
discussed  in  this  section. 

2 4.1  Noise  Reduction  in  State  Space 

Noise  reduction  is  an  important  engineering  problem.  In  this  study,  I assume  that 
noise-free  signals  are  generated  by  some  nonlinear  deterministic  systems.  Therefore,  it  is 
not  appropriate  to  apply  any  linear  optimal  filter  (e.g.  Kalman  filter),  which  is  based  upon 
statistics  of  the  signal  and  the  noise,  to  reduce  noise.  In  addition,  filtering  a signal  with  a 
linear  filter  is  equivalent  to  attaching  a linear  system  to  the  underlying  dynamic.  We 
should  avoid  this  artifact  in  dynamic  modeling.  In  this  section,  I will  review  two  methods 
that  have  been  proposed  to  reduce  noise  in  state  space. 

In  general,  the  system  flow  of  a deterministic  system  is  diffeomorphic  [Par87]. 
This  means  that  a reconstructed  trajectory  from  the  output  of  a deterministic  system  is 
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usually  smooth.  However,  since  signals  are  always  contaminated  by  noise,  reconstructed 
trajectories  are  usually  jagged.  Kostelich  and  Yorke  proposed  to  establish  local  linear 
models  of  a reconstructed  trajectory  to  smooth  the  trajectory  and  consequently  reduce  the 
noise  [Kos90].  The  linear  approximation  that  they  used  to  model  local  behaviors  was  first 
introduced  by  Eckmann  and  Ruelle  [Eck85].  Therefore,  this  approximation  is  also  called 
Eckmann-Ruelle  linearization.  According  to  this  procedure,  we  approximate  the  system 

dynamics  in  a small  neighborhood  of  a point  x as 

f{x)»Jx  + b eq.  2.4. 1-1 

A 

where  J is  the  Jacobian  of / at  x,  and  b is  constant  vector.  Actually,  this  is  a linear  approx- 
imation of  Taylor’s  expansion  of  the  vector  field  /.  Since  /is  usually  unknown,  Kostelich 
and  Yorke  used  the  least  square  to  estimate  the  entries  of  the  matrix  J and  the  elements  of 
the  vector  b for  each  small  region  of  a trajectory.  After  applying  these  linear  local  models 
to  the  reconstructed  trajectory,  we  can  construct  a much  smoother  trajectory.  Conse- 
quently, the  noise  can  be  reduced  to  some  extend.  However,  a long  segment  of  the  signal  is 
usually  required  in  this  procedure  to  obtain  reliable  estimates  of  local  Jacobians.  Kostelich 
and  Yorke  also  mentioned  some  difficulties  in  the  computation  of  the  local  linear  approxi- 
mation, e.g.  ill-conditioned  least  squares,  finding  nearest  neighbors,  and  errors  in  vari- 
ables. 

Another  approach  to  remove  noise  from  a reconstructed  trajectory  was  proposed 
by  Landa  and  Rosenblum  [Lan91].  The  main  idea  is  to  apply  the  difference  in  dimension- 
ality of  a deterministic  signal  and  a random  process  to  separate  noise  from  a signal.  When 
we  reconstruct  a trajectory  from  a random  process,  the  points  will  eventually  disperse  into 
every  dimension  of  the  construction  space  no  matter  how  high  the  dimension  of  the  space 
is.  Thus,  we  say  the  dimension  of  a random  process  is  infinite.  On  the  other  hand,  the  tra- 
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jectory  reconstructed  from  the  output  of  a deterministic  system  has  finite  dimension  and 
will  be  confined  to  a subspace.  This  subspace  is  called  the  embedding  space  of  the  trajec- 
tory. If  we  project  a trajectory  reconstructed  from  a signal,  which  is  contaminated  by  some 
additive  noise,  onto  the  embedding  space  of  the  signal  trajectory  without  noise,  we  should 
be  able  to  remove  high-dimensional  noise. 

This  projection  operator  can  be  regarded  as  a state-space  filter.  This  state-space  fil- 
tering is  superior  to  frequency-domain  filtering  if  both  signal  and  noise  have  broadband 
spectra.  Because  we  remove  noise  based  upon  the  dynamical  properties  of  the  signal  and 
the  noise  instead  of  their  spectra.  Thus,  we  will  not  significantly  distort  signal  components 
in  some  frequency  bands.  As  we  need  to  specify  the  cutoff  frequency  for  a frequency- 
domain  filter,  so  we  need  to  determine  the  embedding  subspace  in  a construction  space. 
Landa  and  Rosenblum  proposed  a method  to  determine  the  signal  subspace.  Therefore,  the 
determination  of  the  embedding  subspace  is  less  subjective  than  the  selection  of  the  cutoff 
frequencies  of  a frequency  filter.  In  the  following,  I will  discuss  this  method  and  give  two 
examples. 

According  to  Takens’  theorem,  the  dimension  of  a construction  space  should  be 
twice  larger  than  the  dimension  of  the  original  system  attractor,  i.e.  2m  > 2d.  However, 
without  any  a priori  knowledge  about  the  dimension  of  the  original  attractor,  we  usually 
choose  the  dimension  of  the  construction  space  much  higher  than  required.  Besides,  even 
though  we  know  the  dimension  of  the  attractor,  we  may  find  that  the  condition  derived  by 
Takens  is  an  overestimate  of  the  reconstruction  dimension  in  some  cases.  Therefore,  the 
embedding  space  of  a signal  trajectory  is  usually  a subspace  of  a construction  space.  Since 
the  dimension  of  a system  attractor  is  an  invariant,  a reconstructed  trajectory  will  also  be 
confined  to  the  embedding  subspace  when  the  dimension  of  a construction  space  is  higher 
than  required.  To  determine  the  embedding  subspace,  we  need  to  find  a set  of  basis  vectors 
and  determine  the  dimension  for  this  subspace. 
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For  convenience,  let  us  define  a trajectory  matrix  as  follows 


-r 

(0) 

-T 

y (1) 

-r 

y (2) 


eq.  2.4. 1-2 


where  y (0  is  a reconstructed  state  vector  at  time  t,  and  T is  the  transpose  operator.  We 
can  apply  the  principal  component  analysis  to  a trajectory  matrix  and  find  out  a set  of  its 
orthogonal  principal  axes.  This  procedure  is  also  called  singular  value  decomposition 

(SVD)  in  linear  algebra  [Leo90].  Let  us  denote  these  principal  axes  as  • • • • They  are 

ranked  according  to  the  descending  order  of  their  corresponding  principal  components. 
Among  all  subspaces  of  the  same  dimension,  the  subspace  formed  by  the  principal  axes 
corresponding  to  the  largest  principal  components  will  have  the  largest  projection  of  the 
trajectory.  If  a trajectory  is  reconstructed  from  a noise-free  signal  and  has  dimension  k,  a 


subspace  spanned  by  /?j,  will  embrace  all  of  the  trajectory.  Therefore,  to  determine 

the  embedding  dimension  of  a trajectory  we  can  increase  the  dimension  of  the  projection 
space  by  adding  one  principal  axes,  corresponding  to  the  next  largest  principal  component, 
at  a time  and  monitor  the  decrease  of  the  residual  error  power,  which  is  defined  as 


||t(0  -Tr(0 


eq.  2.4. 1-3 


A A 

where  (/)  is  the  projection  of  y (t)  onto  a r-dimensional  subspace,  T is  the  total  num- 
ber of  the  state  vectors,  and  ||  lU  represents  f,-2  norm.  When  the  residual  power  decreases 


to  zero,  the  projection  space  is  equal  to  the  embedding  subspace,  and  the  number  of  the 
principal  axes  used  as  the  bases  of  the  projection  space  is  the  dimension  of  the  embedding 
subspace. 
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On  the  other  hand,  if  the  signal  is  contaminated  by  an  additive  white  noise  the  res- 
idue power  can  not  be  reduced  to  zero  unless  the  projection  space  and  the  construction 
space  have  the  same  dimension.  However,  we  should  be  able  to  observe  a smaller,  con- 
stant drop  rate  of  the  residual  power  if  we  keep  increasing  the  dimension  of  the  projection 
space.  This  is  because  when  the  dimension  of  a projection  space  is  higher  than  the  dimen- 
sion of  the  embedding  subspace  this  drop  is  mainly  contributed  by  the  high-dimensional 
noise. 

In  Figure  2-20,  I show  two  test  signals;  one  is  a periodic  signal  generated  by  the 

van  der  Pol  equation(  x+(x  - l)x-i-x  = 0),  and  the  other  is  the  Lorenz  signal.  I add 
noise  to  both  signals  and  set  the  SNR's  for  both  signals  equal  to  10  dB.  According  to  the 
estimation  of  the  mutual  information  functions,  as  shown  in  Figure  2-21, 1 select  x equal  to 
3 for  the  van  der  Pol  signal  and  2 for  the  Lorenz  signal  to  reconstruct  trajectories.  The  con- 
struction space  of  the  van  der  Pol  attractor  has  dimension  5,  and  the  construction  space  of 
the  Lorenz  attractor  has  dimension  8.  In  Figure  2-22,  the  residue  powers  are  plotted 
against  the  number  of  the  pricipal  axes  used  to  form  the  projection  space.  We  note  that  the 
drop  rate  of  the  residue  power  for  the  van  der  Pol  signal  becomes  small  after  we  increase 
the  dimension  of  the  projection  space  to  larger  than  2.  Therefore,  the  estimate  of  the 
embedding  dimension  for  the  van  der  Pol  attractor  is  2.  Similarly,  we  can  estimate  the 
embedding  dimension  for  the  Lorenz  attractor  equal  to  4.  The  van  der  Pol  signal  is  peri- 
odic and  has  a limit  cycle  as  its  attractor.  The  Lorenz  signal  has  a chaotic  attractor  whose 
correlation  dimension  is  2.05  [Gra83c].  Therefore,  these  two  estimates  are  pretty  consis- 


tent with  Takens’  theorem. 
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Figure  2-20.  (a)the  van  der  Pol  signal,  (b)  the  van  der  Pol  signal  in  noise, 
and  (c)  their  spectra  (dashed  line  for  the  original  signal  and  dark  line  for 
the  noise  signal),  (c),  (d),  and  (e)  for  the  Lorenz  signal. 


Figure  2-21.  Estimations  of  the  mutual  information  functions  for  (a)  the  van  der  Pol 
signal,  and  (b)  the  Lorenz  signal. 
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Figure  2-22.  Curves  of  residue  power  (8^)  v.s.  dimension  of  the  projection 
space  (r)  for  (a)  the  van  der  Pol  attractor  and  (b)  the  Lorenz  attractor. 

Once  we  determine  the  embedding  space,  we  can  obtain  the  projection  of  the  tra- 
jectory in  this  subspace.  Since  this  projection  will  remove  high-dimensional  noise,  the 
new  trajectory  should  contain  less  noise.  This  projection  operation  can  be  expressed  as 


where  L'  is  the  projection  of  the  trajectory  matrix,  and  q is  the  embedding  dimension.  To 
recover  a time  series  from  a projected  trajectory,  Landa  and  Rosenblum  suggested  to  use 

the  sequence  x (t) , which  is  computed  by 


where  L'  . is  the  entry  of  the  matrix  L'  in  t-row  and  i-th  column,  is  the  first  element  of 
the  i-th  principal  axis.  They  reported  that  this  sequence  can  preserve  the  waveform  of  the 


original  signal  very  well.  This  is  like  a reverse  procedure  of  the  delay  construction 


L = LV 


eq.  2.4. 1-4 


V — " ^ 

y - \p^P2  ...p^ 


eq.  2.4. 1-5 


i=  1 
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method.  In  Figure  2-23, 1 show  the  recovered  time  series  for  the  van  der  Pol  signal  and  the 
Lorenz  signal  from  their  embedding-space  projections.  After  this  process,  the  SNR  can  be 
raised  to  12.48  dB  for  the  van  der  Pol  signal  and  to  lO.SdB  for  the  Lorenz  signal. 

A similar  noise  reduction  technique  has  been  used  in  spectral  analysis  and  is  called 
rank  reduction  approximation[Mar87].  An  example  of  this  approach  can  be  found  when 
we  try  to  reduce  the  order  of  an  FIR  filter.  Given  an  input  data  matrix  A and  the  vector  of 
the  output  desired  signal  b,  we  can  compute  the  coefficient  vector  x by  solving  the  linear 
equation  Ax=b.  We  can  apply  the  same  procedure  mentioned  above  to  reduce  the  rank  of 
the  data  matrix  A.  In  this  approach,  the  data  matrix  plays  the  same  role  as  a trajectory 
matrix.  However,  the  projection  space  does  not  have  the  same  physical  meaning,  and  the 
determination  of  the  projection  dimension  requires  some  prior  knowledge  of  the  noise 
level. 

This  state-space  filtering  enables  us  to  reduce  high-dimensional  noise.  But,  noise 
still  corrupts  the  signal  in  the  embedding  space.  This  low-dimensional  noise  component  is 
analogous  to  the  noise  in  the  pass  band  of  a frequency  filter.  In  the  frequency  domain,  it  is 
extremely  difficult  to  separate  a signal  and  a noise  in  the  same  frequency  band.  On  the 
other  hand,  we  know  that  the  noise  existing  in  the  embedding  space  of  a deterministic  tra- 
jectory will  cause  the  jaggedness  of  the  trajectory.  If  we  can  smooth  the  trajectory  some- 
how this  low-dimensional  noise  can  be  reduced  further.  This  idea  will  be  further  explored 
in  section  4.4  where  I propose  to  use  a trajectory  predictor  to  further  remove  the  noise  in 
the  embedding  space. 
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Figure  2-23.  Time  series  recovered  from  projections  (a)  for  the  van  der  Pol  sig- 
nal and  (b)  for  the  Lorenz  signal. 


2.4.2  Dynamic  Modeling 

According  to  Takens’  Embedding  theorem,  when  the  dimension  of  the  construe- 

tion  space  (2m+l)  is  higher  than  (2d+l),  a map  F:R  ->  7?  exists,  which  governs 
the  evolution  of  the  trajectory  reconstructed  from  a given  signal.  The  mapping  F trans- 

A 

forms  the  current  reconstructed  state  y(t)  to  the  next  state  y(t  + 1) 

1)  = F(>(0)  eq.  2.4.2-1 


x(/+  1) 

f 

x(0 

\ 

x{t  + 2) 

= F 

X (r  + 1) 

• • • 

X (/  + 1 + 2m) 

• • • 

x(t  + 2m) 

J 

t 
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Here,  we  assume  that  the  trajectory  is  reconstructed  by  using  the  delay  coordinate  method 
with  X =1 . We  note  that  the  mapping  actually  includes  several  trivial  filters  and  a predictor. 

I “1“  1 

The  predictor  part  is  what  we  are  interested  in.  This  predictive  mapping,  F :R  —>R, 
can  be  expressed  by 


x(r+  1 +2m)  = F^(x(/),^(^+  \ ) ,...,x(t  + 2m)) 
or  x(/+  1)  = F^(x(0) 


eq.  2. 4.2-2 


T 

where  x (/)  = [x  (?  - 2m)  . . . x (/  - 1 ) x (/)]  • actually  an  autoregressive  model 

of  the  signal.  Since  the  map  is  usually  nonlinear,  eq.  2. 4.2-2  is  a nonlinear  autoregressive 
(NAR)  model.  The  existence  of  this  predictive  model  in  a deterministic  signal  lays  a theo- 
retical basis  for  dynamic  modeling.  In  dynamic  modeling,  we  build  from  the  signal  a 

model  to  approximate  the  mapping  F^.  According  to  eq.  2.4.2-2,  the  resulting  model  can 
be  obtained  by  a one-step  signal  predictor.  This  means  that  given  the  present  and  the  previ- 
ous data  samples  the  model  should  be  able  to  predict  the  next  data  sample. 

However,  we  note  that  the  mapping  F is  the  flow  of  the  reconstructed  dynamics, 

which  is  shown  to  be  equivalent  to  the  original  system  flow  (|)  in  eq.  2.2-6.  In  other  words, 
the  iterates  of  the  mapping  F for  the  same  initial  condition  should  yield  the  entire  recon- 
structed trajectory.  Likewise,  the  iterates  of  the  mapping  F^  should  reproduce  the  original 
signal.  Therefore,  when  building  a dynamic  model  we  also  need  to  consider  the  fitness  of 
the  iterative  map  of  the  model.  This  consideration  distinguishes  my  work  from  most  of  the 
others’ . 

In  modeling  a chaotic  time  series,  most  researchers  just  construct  a one-step  pre- 
dictor for  the  signal  [Cas91][Lap87][Mea91][Wei90].  And,  an  optimal  one-step  predictor 
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is  usually  built  to  minimize  the  following  criterion 

E = 2 dist  (x  (/  + 1 ) - ~F^  {x  (0 ))  eq.  2.4.2-3 

i = 2m  + 1 

where  / is  the  length  of  the  signal,  x (i)  is  the  i-th  data  sample  of  the  signal,  F ( ) is  the 
map  developed  by  the  model  to  approximate  and  dist  ( ) is  a distance  function.  The 
prediction  results  and  their  corresponding  data  samples  can  be  expressed  as 

x (/■  + 1)  = F (x  (0  ) + 5,  eq.  2.4.2-4 

x(i  + 2)  =F  (x(/'+1))+52 


where  6 . is  the  error  in  predicting  the  data  sample  at  time  i+j.  To  simplify  the  discussion, 
let  us  assume  that  a model  is  optimized  only  for  the  predictions  of  x(i+l)  and  x(i+2).  As  a 
results,  both  5.  and  5-  will  be  minimized.  After  we  obtain  the  model,  we  iterate  it  to  pre- 
dict x(i+l)  and  x(i+2).  We  note  that  as  long  as  5j  is  not  equal  to  zero  the  iterate  of  the 
model,  corresponding  to  the  estimate  of  x(i+2),  will  not  be  optimized.  This  is  because  the 
estimate  of  x(i+2)  is  optimized  only  when  we  feed  the  true  values  of  the  past  data  samples 
into  the  model  to  make  the  prediction,  and  in  the  iteration  we  replace  x(i+l)  with 
F (x  (/) ) to  predict  x(i+2).  We  can  generalize  this  discussion  to  any  optimal  one-step 
predictors.  Therefore,  we  can  conclude  that  an  optimal  one-step  predictor,  when  iterated, 
may  not  be  able  to  produce  a sequence  that  fits  the  original  data  samples  best.  In  other 
words,  building  a model  as  a one-step  predictor  is  not  sufficient  to  constrain  the  iterative 
map  of  the  model. 

To  consider  the  fitness  of  the  iterative  map  of  a model,  I propose  to  use  the  follow- 


ing optimization  criterion 
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/ 

E=  ^ dist (x  (i  + \)  - X (i  + \)) 

i = 2m  + 1 


eq.  2. 4.2-5 


x(/-i-l)  =F  (Jc(/-2w),...^(/- 1)^(/)) 

r^o) 


\ <j  < 2m  + 1 
j > 2m  + 1 


where  x(i)  is  the  iterate  of  the  model.  According  to  eq.  2.4. 2-1,  we  may  consider  x(i)  as 

O 1 

an  estimate  of  the  iterative  map  [F*  (x(2«/ + 1))]  , where  i.  denotes  the  pre- 


dictive part  of  the  mapping.  Minimizing  this  criterion,  we  can  obtain  an  optimal  multi-step 
predictor.  To  compare  with  a one-step  predictor,  let  us  assume  that  we  build  an  optimal 
iterative  two-step  predictor.  We  have 


x(;  + 1) 
X (/  4-  2) 


~1 


= F (x(/))  +5j  = x(/  + 1)  +5j 


~± 


= F (x(i-2m+  l),...pr(/-  l)pi:(/),x(;-l- 1))  +5, 


f~2  . V 

^F(x(0)J  +6, 


eq.  2.4. 2-6 


When  we  minimize  6,  and  5-,  we  not  only  try  to  construct  an  optimal  predictive  model 
but  also  try  to  constrain  the  iterative  map  of  the  model.  In  this  case,  the  resulting  model 
can  yield  the  predictions  of  the  next  two  data  samples,  by  iteration,  that  fit  their  original 
values  best.  For  a general  case,  the  number  of  the  constraints  that  we  impose  on  the  itera- 
tive map  of  a model  is  equal  to  the  length  of  the  sequence  (/  - 2m)  in  the  criterion  of 
eq. 2.4. 2-5.  Thus,  minimizing  the  criterion  of  eq.  2.4. 2-5,  we  can  build  an  optimal  predic- 
tive model  whose  iterative  map  fits  the  dynamics  of  the  signal  best.  Therefore,  we  may 
conclude  that  in  dynamic  modeling  the  optimality  criterion  for  multi-step  predictors  is  bet- 
ter than  the  optimality  criterion  for  one-step  predictors. 
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In  order  to  compute  the  parameters  of  an  optimal  model  for  the  criterion  given  in 
eq.  2.4. 2-3  or  eq.  2. 4.2-5,  we  can  solve  the  following  equations 

dE 

- — = 0 i - \,2,  ...,n  eq.  2.4. 2-7 

ow . 

I 

where  w . is  the  i-th  parameter,  and  n is  the  total  number  of  the  parameters.  Since  the 
model  F is  nonlinear  and  E is  a function  of  compositions  of  F , it  may  be  very  difficult 
to  solve  the  equations  directly  for  a given  signal.  Usually,  an  adaptation  scheme  is  pre- 
ferred in  this  optimization  problem.  In  this  work,  I use  adaptive  algorithms  to  train 
dynamic  models  based  on  artificial  neural  networks. 

In  optimizing  the  parameters  of  a model,  we  prepare  several  training  sequences 
from  different  data  segments  of  the  signal.  The  goal  is  to  avoid  the  resulting  model  to 
become  just  an  interpolator  of  a signal  segment.  In  Figure  2-24, 1 give  an  example  of  how 
to  prepare  training  sequences  from  a time  series.  We  note  that  two  consecutive  training 
sequences  can  overlap.  Therefore,  the  criterion  in  eq.  2.4. 2-3  will  be  modified  as 

r I 

F = ^ ^ dist{x{i +j  • q+ \) -x{i +j  • q->r  \))  eq.  2.4.2-S 

j = Oi  = 2m  4- 1 

where  r+1  is  the  total  number  of  the  training  sequences,  and  q is  the  distance  of  the  initial 
points  between  two  consecutive  training  sequences.  Since  each  training  sequence  corre- 
sponds to  a segment  of  a reconstructed  trajectory,  a training  procedure  to  minimize  this 
criterion  is  called  trajectory  learning  [Her91]. 
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training  sequence  i 

training  sequence  i+1 


Figure  2-24.  Preparing  training  sequences  from  a signal. 


In  this  criterion,  the  longer  the  training  sequence,  the  more  constraints  we  will 
impose  on  the  iterative  map  of  a model.  In  nonlinear  optimization,  we  will  decrease  the 
possibility  of  obtaining  a sub-optimal  solution  by  doing  this.  However,  the  complexity  of 
the  adaptation  algorithms  to  minimize  this  criterion  usually  increases  dramatically  with 
the  length  of  the  training  sequences.  In  addition,  we  found  that  training  a model  of  a cha- 
otic signal  with  long  training  sequences  may  bring  instability  to  the  adaptation.  In  the  next 
section,  I will  propose  several  methods  to  determine  the  length  of  the  training  sequences 
according  to  the  measurement  of  the  dynamical  invariants  of  the  signal. 

2.5  Dynamical  Invariants  in  Modeling 
2.5.1  Attractor  Dimension  and  Model  Order 

According  to  Takens’  Embedding  theorem,  the  reconstruction  dimension  must  be 
at  least  twice  the  dimension  of  the  underlying  system  attractor.  This  condition  guarantees 
that  there  is  no  crossing  of  a trajectory  in  reconstruction.  Since  a crossing  point  on  the  tra- 
jectory implies  a one  to  many  mapping,  the  existence  of  the  crossing  points  may  result  in 
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modeling  ambiguities.  In  eq.  2.4.2-2,  we  note  that  the  reconstruction  dimension  is  equiva- 
lent to  the  order  of  the  dynamic  model  which  gives  the  number  of  data  samples  in  the  past. 
For  instance,  in  the  next  chapter  we  will  propose  using  artificial  neural  networks  in 
dynamic  modeling.  And,  the  model  orders  of  the  architectures,  shown  in  Figure  3-4  and  3- 
5,  are  just  the  number  of  the  delay  units  in  the  input  layer.  Therefore,  Takens’  theorem 
already  gives  us  a lower  bound  for  the  model  order.  However,  we  may  find  that  a model 
whose  order  is  estimated  according  to  Takens’  theorem  may  perform  badly,  i.e.  the  low 
bound  must  be  refined. 

In  predicting  a chaotic  time  series.  Mead  et  al.  had  studied  the  relationship 
between  the  performance  of  a model  and  the  input  time  span  [Mea91].  The  input  time  span 
is  defined  as 

w — At  • p eq.  2.5. 1-1 


where  At  is  the  sampling  period  of  the  signal,  and  p is  the  order  of  the  model.  They 

reported  that  the  order  of  the  model  can  not  be  determined  solely  by  Takens’  theorem  as 
suggested  above.  We  also  need  to  take  into  account  the  sampling  rate  of  the  signal.  In  their 
work,  the  signal  was  produced  by  integrating  the  Mackey-Glass  equation  [Mac77].  The 
equation  is  a time-delay  differential  equation,  which  was  first  proposed  as  a model  of 
white  blood  cell  production.  The  equation  is  given  by 


ax(t  -D) 

1 +/(/-£)) 


- bx(t) 


eq.  2.5. 1-2 


where  a,  b,  and  c are  constant  coefficients,  and  D is  a time  delay.  The  following  coeffi- 
cients were  selected:  a=0.2,  b=0. 1,  c=10,  and  D=30.  The  solution  is  a chaotic  signal.  I will 
refer  to  this  signal  as  mg30  hereafter.  Mead  et  al.  trained  an  artificial  neural  network  as  a 
predictor  of  the  signal.  Their  results  show  that  the  input  time  span  of  a successful  model  is 
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always  larger  than  D (30),  or  the  delay  of  the  underlying  system.  Accordingly,  the  order  of 
the  model  can  be  reliably  determined  if  the  equation  is  given.  However,  in  most  real-life 
problems  the  equations  of  motion  are  unknown. 

In  this  work,  I propose  to  determine  a refined  lower  bound  for  the  model  order 
based  on  the  CIM  (Correlation  Integral  Map)  curves.  In  the  computation  of  the  correlation 
dimension,  the  saturation  of  the  slope  of  the  CIM  curves  depends  on  the  dimension  of  the 
reconstruction  space.  When  the  saturation  occurs,  we  may  say  the  reconstruction  is  topo- 
logically equivalent  to  the  original  system  dynamics  because  the  attractors  are  of  the  same 
dimension.  Therefore,  the  smallest  dimension  of  the  reconstruction  space  in  which  the  sat- 
uration occurs  can  serve  as  a refined  lower  bound  estimate  of  the  order.  Although  being 
qualitative,  this  criterion  is  useful  as  we  are  going  to  show  next.  We  note  that  in  the  formu- 
lation of  the  dynamic  modeling  (see  eq.  2. 4.2-2)  we  set  x equal  to  1.  With  this  selection, 
the  saturation  of  the  slope  of  the  CIM  curves  usually  occurs  in  a higher-dimensional  space. 
This  is  because  the  elements  of  the  construction  states  are  highly  correlated. 

To  verify  this  method,  I compute  the  CIM  curves  for  two  time  series  sampled  from 
the  Mackey-Glass  signal  at  different  rates;  one  at  1/3  Hz,  and  the  other  at  1/6  Hz.  The 
results  are  shown  in  Figure  2-25  and  Figure  2-26.  We  note  that  the  slope  of  the  CIM 
curves  for  the  fast-sampled  signal  saturates  after  we  increase  the  construction  dimension 
over  9,  and  for  the  slow-sampled  signal  it  saturates  after  the  construction  dimension  is 
over  5.  Therefore,  the  refined  lower  bound  estimate  of  the  model  order  is  9 for  the  first 
time  series,  and  5 for  the  other.  The  input  time  spans,  w,  for  both  cases  are  very  close  to 
30,  the  delay  D in  the  eq.  2.5. 1-2.  These  results  are  consistent  with  the  conclusion  given  by 
Mead  et  al.  The  advantage  of  this  estimation  method  is  that  we  do  not  need  any  a priori 
knowledge  about  the  system  equation. 
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Figure  2-25.  (a)  CIM  curves  and  (b)  their  slope  estimates  for  the  mg30  signal, 
(sampling  rate:  1/3  Hz,  reconstruction  dimensions:  2-12) 


Figure  2-26.  (a)  CIM  curves  and  (b)  their  slope  estimates  for  the  mg30  signal, 
(sampling  rate:  1/6  Hz,  reconstruction  dimensions:  2-12) 


2.5.2  Maximal  Lyapunov  Exponent  and  Length  of  Training  Sequences 

The  length  of  the  training  sequences,  /',  is  equivalent  to  the  number  of  constraints 
that  we  impose  on  the  iterative  map  of  a dynamic  model.  In  principle,  the  model  will 
approximate  the  dynamics  of  a signal  better  if  we  use  long  training  sequences.  However, 
in  practice,  the  computation  of  the  model  parameters  may  require  tremendous  computer 
resources  if  the  training  sequences  are  too  long.  In  addition,  when  modeling  chaotic  time 
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series  long  training  sequences  may  cause  the  stability  problem  (or  the  convergence  prob- 
lem in  the  literature  of  adaptive  signal  processing  [Hon84]). 

From  the  point  of  view  of  dynamic  modeling,  training  sequences  should  contain 
enough  information  about  the  global  picture  of  an  attractor  to  be  modeled.  Therefore,  in 
this  research,  the  length  of  the  training  sequences  is  required  to  be  not  shorter  than  the 
reciprocal  of  the  median  frequency  of  a given  signal.  This  threshold  is  actually  an  estimate 
of  the  orbit  time  of  the  signal  dynamics,  which  is  the  average  time  span  required  for  a 
point  to  return  to  the  same  neighborhood  in  the  attractor.  Therefore,  a training  sequence 
containing  the  global  information  of  the  attractor  should  be  longer  than  this  time  span. 

In  this  section,  I will  propose  different  criteria  to  determine  the  length  of  the  train- 
ing sequences  for  non-chaotic  signals  and  chaotic  signals.  For  those  chaotic  signals  with 
large  positive  Lyapunov  exponents,  the  estimate  of  /'  may  be  much  shorter  than  their  orbit 
time  estimates,  and  we  can  hardly  impose  any  constraint  on  the  iterative  map  of  the  model. 
For  this  case,  I also  propose  a two-stage  training  procedure  to  incorporate  the  constraints. 

2.5.2. 1 Non-chaotic  signals 

For  non-chaotic  signals,  the  selection  of  the  sequence  length  is  relatively  easy. 
This  is  because  a non-chaotic  dynamics  has  only  non-positive  Lyapunov  exponents  and  all 
of  the  system  trajectories  approaching  the  same  attractor  will  get  closer  as  time  evolves. 
Eventually,  all  of  these  trajectories  will  become  indistinguishable.  Therefore,  if  a model 
already  captures  the  reconstructed  dynamics  of  a signal,  the  output  of  the  model,  in  steady 
state,  should  converge  to  the  signal.  In  other  words,  there  will  be  no  convergence  problem 
in  training  due  to  the  selection  of  a large  /' . Therefore,  the  consideration  in  determining  /' 
for  non-chaotic  dynamics  is  that  the  training  sequences  should  provide  enough  informa- 
tion about  the  system  dynamics  and  result  in  an  optimization  problem  of  a moderate  size. 

For  a limit  cycle,  a well-defined  orbit  time  exists,  and  the  waveform  of  the  signal 
will  repeat  itself  every  orbit  time.  A training  sequence  with  a length  equal  to  one  orbit  time 
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contains  the  entire  attractor  dynamics.  Therefore,  we  can  select  /'  equal  to  the  number  of 
the  data  samples  that  covers  one  orbit  of  the  attractor.  For  instance,  to  build  a dynamic 
model  for  a limit  cycle  we  can  select  /'  equal  to  at  least  one  period  of  the  signal,  which  is 
equivalent  to  the  orbit  time  of  the  limit  cycle. 

Of  course,  we  can  choose  /'  longer  than  one  orbit  time.  However,  the  convergence 
speed  of  the  training  will  slow  down  due  to  the  existence  of  more  constraints.  And,  we 
will  increase  the  computation  time  and  computer  memory  requirements. 

2 5.2.2  Chaotic  signals  with  small  positive  Lyapunov  exponents 

For  chaotic  signals,  the  problem  is  very  different.  Besides  the  consideration  of  the 
computation  complexity,  the  selection  of  /'  should  avoid  oscillations  in  training.  When  the 
parameters  of  a model  oscillate  during  training,  they  will  never  converge  to  a solution.  In 
modeling  chaotic  signals,  there  are  two  sources  which  may  cause  the  stability  problem. 
Both  can  be  related  to  the  existence  of  the  positive  Lyapunov  exponents. 

The  first  source  of  the  oscillation  can  be  described  as  follows:  assume  a model 
starts  capturing  the  dynamics  of  a given  signal.  In  other  words,  the  iterates  of  the  model 
and  the  signal  can  be  treated  as  two  trajectories  in  the  same  attractor.  Due  to  the  error  in 
predicting  the  first  and  subsequent  data  samples,  these  two  trajectories  are  considered  to 
start  from  two  neighboring  initial  conditions.  The  existence  of  the  positive  Lyapunov 
exponents  will  make  these  two  nearby  trajectories  diverge  from  each  other.  In  other  words, 
the  divergence  between  the  signal  and  the  model  output  are  caused  by  the  system  dynam- 
ics. An  example  to  show  this  divergence  is  given  in  Figure  2-27.  I plot  two  segments  of 
the  mg30  signal,  which  start  from  almost  the  same  initial  condition.  After  60  data  samples, 
both  segments  diverge  from  each  other  very  fast.  This  example  suggests  that  even  an  exact 
model  can  not  reproduce  the  same  signal.  Therefore,  any  attempt  to  minimize  this  diver- 
gence will  not  improve,  but  will  deteriorate  the  modeling  accuracy.  As  a result,  this  effort 
will  result  in  a convergence  problem.  According  to  this  reasoning,  we  may  conclude  that 
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we  should  avoid  using  long  training  sequences  in  modeling  a chaotic  signal,  and  the  errors 
in  the  later  iterations  of  the  model  should  be  weighted  less  in  the  optimization  criterion. 
Therefore,  the  criterion  in  eq.  2A.2-8  is  modified  as  follows: 

r / 

Z E h (/)  • dist  (x  (i  +j  • q + \)  - X (i  +j  • q + 1))  eq.  2. 5.2. 2-1 

j - 0/  = 2m  + 1 

where  h (/)  is  the  weight  that  we  put  on  the  error  (distance)  for  the  i-th  iteration.  The 
choice  of  the  weighting  function  should  depend  on  the  positive  Lyapunov  exponent  of  the 
system  to  be  modeled.  We  will  discuss  this  topic  again  at  the  end  of  this  section  after  we 
formulate  the  divergence  between  the  output  of  a model  and  a given  signal. 


Figure  2-27.  Two  segments  of  the  mg30  signal. 

The  second  source  of  the  oscillation  arises  when  a model  is  trained  to  learn  two 
sequences  whose  initial  conditions  are  close  to  each  other.  In  state  space,  these  two 
sequences  are  equivalent  to  two  nearby  segments,  say  segment  A and  B,  of  the  trajectory 
reconstructed  from  the  signal  to  be  modeled.  When  the  model  starts  capturing  the  dynam- 
ics, as  we  discussed  above,  the  iterates  of  the  model  corresponding  to  these  two  training 
sequences  can  be  used  to  reconstruct  two  trajectories  around  segment  A and  segment  B 
respectively.  This  is  illustrated  in  Figure  2-28.  Instead  of  specifying  the  trajectories  recon- 
structed from  the  outputs  of  the  model,  two  uncertainty  regions  are  delimited.  They  repre- 
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sent  the  possible  divergence  range  between  the  model  output  and  the  signal  due  to  the 
existence  of  the  positive  Lyapunov  exponents.  As  time  evolves,  both  uncertainty  regions 
will  grow  and  eventually  overlap.  When  this  overlap  occurs,  the  training  may  become 
unstable.  This  is  because  if  the  output  of  the  model  falls  into  the  overlap  region  the  model 
is  required  to  develop  a map  to  follow  the  evolution  of  both  segment  A and  segment  B 
during  training.  Since  segment  A and  segment  B will  diverge  from  each  other  eventually, 
it  is  not  possible  for  the  model  to  meet  this  conflicting  requirement. 

To  analyze  after  how  many  iterations  in  the  average  we  can  expect  this  overlap  to 
occur,  let  us  first  quantify  the  divergence  of  the  model  output  and  the  signal  due  to  the  pos- 
itive Lyapunov  exponents  (we  assume  the  model  is  already  close  to  the  underlying  sys- 
tem.). As  we  mentioned  in  section  2.3.2,  the  divergence  of  two  nearby  trajectories  is 
mainly  determined  by  the  largest  Lyapunov  exponent.  And,  the  length  of  the  first  principal 
axis  of  the  uncertain  region  around  the  signal  trajectory  at  i-th  iteration  can  be  estimated 
by 

X iAt  , _ 

£.  = e^e  i = 1,2,...  eq.  2. 5. 2. 2-2 

where  8 . is  the  length  of  the  principal  axis  of  the  uncertain  region  at  i-th  iteration,  8q  is  the 
initial  separation,  X is  the  largest  Lyapunov  exponent,  and  At  is  the  sampling  period 
of  the  signal.  In  this  formulation,  we  will  use  the  average  one-step  prediction  error  as  the 
initial  separation  of  the  signal  and  the  model  output.  To  obtain  an  estimate  of  8q,  we  can 
train  another  model  as  a one-step  predictor.  And,  the  largest  Lyapunov  exponent  can  be 
estimated  by  using  the  algorithm  introduced  in  section  2.3.2. 


70 


uncertainty  regions 
at  i-th  iteration 


gment  A 


8 


0 


segment  B 


Figure  2-28.  State  space  representation  in  training  a model 
with  two  sequences  whose  initial  conditions  are  close. 


According  to  eq.  2. 5. 2. 2-2,  we  can  estimate  the  smallest  number  of  iterations, 


where  q.  is  the  distance  between  the  / -th  points  on  both  training  segments.  This  estima- 

s 

tion  is  based  on  the  assumption  that  the  first  principal  axes  of  both  uncertain  regions  are  in 
the  same  line  but  in  opposite  directions.  If  after  iterations  the  distance  between  the  seg- 
ment A and  the  segment  B are  equal  to  or  shorter  than  2e . , the  uncertainty  regions  will 

s 

overlap.  To  avoid  training  the  model  with  a conflicting  requirement,  both  training 
sequences  should  be  smaller  than  Therefore,  I propose  to  use  the  average  of  /^’s  com- 
puted from  all  pairs  of  neighboring  training  sequences  as  an  estimate  of  /' . This  estimate 
can  be  computed  by 


before  the  overlap  occurs  from  the  following  inequality 


eq.  2.5.2. 2-3 


r 


/'  = - y / 


eq.  2. 5.2. 2-4 


71 


where  r is  the  number  of  the  training  sequences,  7 is  the  estimate  of  for  the  j-th  pair  of 

S S 

nearby  training  sequences. 

Using  this  method  to  estimate  the  length  of  the  training  sequences,  we  need  to 
obtain  three  measurements  first:  average  one-step  prediction  error,  the  largest  Lyapunov 
exponent,  and  / ’s.  Usually,  the  computation  time  required  to  obtain  these  estimates  is 
much  less  than  training  of  a dynamic  model  by  choosing  /'  by  trial  and  error.  Therefore, 
this  method  can  really  reduce  or  eliminate  a time-consuming  trial -and-error  process  in 
determining  a proper  length  of  the  training  sequences. 

In  eq.  2. 5. 2. 2-2,  we  usually  use  the  1-2  norm  to  measure  the  distance  between 
points.  If  the  norm  is  used  as  distance  measure,  eq.  2. 5. 2. 2-2  can  be  approximated  by 


, X iAt 

Ct  ■ — SqC 


/ = 1,2, ... 


eq.  2. 5. 2.2-5 


where  d.  is  the  error  between  i-th  iterate  of  the  model  and  the  corresponding  data  sample 
of  the  signal.  To  derive  this  approximation,  let  us  assume  we  are  measuring  the  distance  of 
two  reconstructed  state  vectors. 


y(0  - x(t-2m)  ...  x(/- 1)  x(o] 


eq.  2. 5. 2. 2-6 


>'(0  = 


p ) 


i - 2m 


J 


where  x(t)  is  the  data  sample  of  the  signal  at  time  t,  and  yF  j is  the  i-th  iterate  of  the 


model  corresponding  to  x(t).  Since  the  error,  in  the  average,  will  grow  with  iteration,  the 
distance  between  these  two  vector  in  terms  of  the  norm  can  be  approximated  by 
x(0-  \F  J 1.  This  approximation  yields  eq.  2. 5. 2. 2-5.  This  formula  to  estimate  the 
divergence  of  the  signal  and  the  output  of  the  model  in  the  time  domain  was  first  proposed 
by  Casdagli  [Casdagli].  He  showed  numerically  that  this  formula  is  very  consistent  with 
the  performance  of  an  accurate  model.  I will  discuss  this  formula  again  in  section  2.5.3  for 


model  validation. 


72 


In  eq.  2.5.2.2-1, 1 proposed  to  assign  different  credits  to  the  errors  at  different  iter- 
ations. According  to  eq.  2.5.2.2-S,  we  can  design  a weighting  function  as  follows 


X At^ 

h(i)  = (e-  ) 


-(/-  (2m  + 1)) 


eq.  2. 5. 2.2-7 


The  errors  in  the  later  iterations  will  be  weighted  less  in  the  optimization  criterion.  With 
this  selection,  we  can  rewrite  eq.  2.5. 2. 2-1  as 


Z Z 

= Oi  = 2m  + 1 


y At  ~U  ~ + 1)  ) 

- ) dlsHx(i)-yAi))  eq.2.5.2.2-8 


max 


In  Chapter  3,  we  will  discuss  several  gradient-descent  adaptive  algorithms  that  can  be 
used  to  minimize  this  cost  function. 


2.5. 2.3  Chaotic  signals  with  large  positive  Lyapunov  exponents 

For  some  chaotic  signals,  their  positive  Lyapunov  exponents  can  be  so  large  that 
the  uncertain  region  around  each  training  sequence  grows  very  fast.  Consequently,  the 
estimate  of  the  length  of  the  training  sequences  may  become  very  short,  and  we  can  hardly 
impose  any  constraints  on  the  iterative  map  of  a model.  For  this  case,  I propose  a two- 
stage  training  procedure  which  can  incorporate  these  constraints  into  the  training. 

In  the  first  stage  of  the  training,  we  do  not  impose  any  constraint  on  the  iterative 
map  of  the  model.  In  other  words,  the  model  is  trained  as  a one-step  predictor.  This  is 
because  if  we  directly  train  a model  with  sequences  prepared  from  a signal  whose  positive 
Lyapunov  exponents  are  very  large,  we  may  find  that  the  required  mapping  embedded  in 
these  sequences  can  become  very  difficult  to  develop.  This  problem  can  be  illustrated  by 
an  example.  In  Figure  2-29, 1 show  several  data  segments  of  the  Lorenz  signal  (with  o=10, 
r=28,  and  b=8/3).  These  segments  all  start  from  nearby  initial  conditions.  Suppose  we  pre- 
pare training  sequences  from  these  data  segments  as  follows:  the  first  eight  data  samples 
of  each  segment  will  be  loaded  as  an  initial  condition,  and  the  rest  of  the  segment  is  the 


73 


target  sequence  that  the  model  is  trained  to  produce.  In  Figure  2-29,  we  find  that  the  target 
sequences  can  be  very  different  even  though  their  initial  conditions  are  close.  This  diver- 
sity of  possible  target  sequences  will  cause  the  instability  problem  during  training.  There- 
fore, if  we  train  a model  with  one  of  these  sequences  directly,  the  model  may  never 
develop  an  approximating  mapping  at  all. 


Figure  2-29.  Divergence  of  data  segments  of  the  Lorenz  signal. 


The  second  stage  of  the  training  starts  when  the  average  one-step  prediction  error 
becomes  smaller  than  a preselected  error  threshold.  At  this  stage,  we  will  train  the  model 
to  produce  a target  sequence  for  a given  initial  condition.  However,  the  target  sequence  is 
not  necessary  the  segment  that  directly  follows  the  initial  condition.  We  load  the  model 
with  an  initial  condition,  and  iterate  the  model  to  yield  a sequence  of  certain  length.  Then, 
we  compare  this  sequence  with  several  target  sequences  whose  initial  conditions  are 
located  in  the  same  neighborhood,  and  choose  the  one  that  is  closest  to  the  model  iterates 
as  the  target  sequence  for  the  model  to  produce.  In  this  manner,  we  are  able  to  impose 
some  constraints  on  the  iterative  map  of  the  model. 

In  this  procedure,  the  model  is  trained  to  learn  the  local  behavior  in  the  first  stage, 
and  refine  its  iterative  map  in  the  second  stage.  We  note  that  the  model  is  not  required  to 
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reproduce  the  sequence  following  directly  a given  initial  condition.  The  target  output 
sequence  is  determined  after  we  compare  the  iterates  of  the  model  with  several  candidate 
target  sequences.  Therefore,  I will  call  this  training  selectable-target-sequence  training  io 
distinguish  it  from  a normal  sequence  training  as  we  mentioned  in  the  previous  section. 
Let  us  use  the  example  given  in  Figure  2-28  to  compare  this  training  with  a normal 
sequence  training.  For  a normal  sequence  training,  the  model  is  required  to  learn  two  dif- 
ferent sequences  (segment  A and  B)  even  when  those  two  sequences  have  initial  condi- 
tions in  the  same  neighborhood.  On  the  contrary,  in  an  selectable-target-sequence  training, 
the  model  is  trained  to  produce  either  segment  A or  segment  B,  but  not  both,  when  those 
two  segments  have  initial  conditions  in  the  same  neighborhood. 

In  this  two-stage  training  procedure,  there  are  three  parameters  that  we  need  to 
determine:  the  error  threshold,  the  length  of  the  target  sequences,  and  the  number  of  candi- 
date target  sequences. 

When  the  average  one-step  prediction  error  is  smaller  than  the  error  threshold,  the 
training  of  the  model  will  be  switched  from  a one-step  prediction  training  to  an  selectable- 
target-sequence  training.  During  training,  we  usually  plot  a learning  curve  to  monitor  the 
learning  of  the  model.  This  curve  represents  the  performance  of  the  model  in  one-step  pre- 
diction in  each  training  epoch  after  we  present  all  of  the  training  sequences  to  the  model 
once  and  update  the  model  parameters.  When  this  switch  occurs,  we  usually  observe  some 
oscillations  in  the  learning  curve.  These  oscillations  may  indicate  that  in  the  one-step  pre- 
diction training  the  mapping  developed  in  the  model  can  be  a solution  of  a local  minimum, 
and  when  we  impose  the  constraints  on  the  iterative  map  of  the  model  the  adaptation  will 
try  to  escape  from  this  local  minimum.  If  we  choose  a small  error  threshold,  the  oscilla- 
tions may  last  for  a long  time  in  the  learning  curve.  On  the  other  hand,  if  we  choose  a too 
large  error  threshold  the  model  may  not  be  able  to  learn  the  local  behavior  before  we 
switch  to  the  selectable-target-sequence  training.  Consequently,  the  training  will  converge 
very  slowly.  I still  can  not  devise  a systematic  approach  to  determine  this  error  threshold. 


75 


However,  based  on  my  experimental  results,  I usually  choose  the  error  threshold  twice  the 
average  error  of  a one-step  predictor. 

In  an  selectable-target-sequence  training,  we  also  need  to  determine  the  length  of 
the  training  sequences.  As  a rule  of  thumb,  we  may  choose  the  length  equal  to  the  order  of 
the  model.  In  the  case  which  requires  longer  training  sequences  to  provide  enough  global 
information,  we  can  apply  the  following  procedure:  for  a given  initial  condition,  we  iterate 
the  model  k • /'  times,  where  k is  a constant.  We  decide  the  first  target  output  sequence  of 
length  /'  by  comparing  the  first  /'  iterates  of  the  model  with  those  candidate  target 
sequences  whose  initial  conditions  are  in  the  same  neighborhood.  The  target  output 
sequence  for  the  next  /'  iterates  of  the  model  can  be  determined  in  the  same  manner.  But 
now  we  use  the  first  target  output  sequence  as  an  initial  condition  and  search  candidate  tar- 
get sequences.  Again,  we  compare  the  iterates  of  the  model  with  these  candidate  target 
sequences  to  determine  the  second  target  output  sequence.  We  can  continue  this  procedure 
until  we  find  k target  output  sequences.  Then,  we  cascade  these  target  sequences  together 
to  compose  a long  training  sequence.  In  this  research,  I have  not  further  explored  this  idea 
yet. 

When  we  increase  the  number  of  candidate  target  sequences,  the  actual  target  out- 
put may  change  frequently  among  different  sequences  when  the  same  initial  condition  is 
loaded  to  the  model.  This  may  cause  also  some  instability.  Therefore,  I propose  to  prepare 
only  two  candidate  target  sequences  for  each  initial  condition. 

Although  a systematic  approach  to  determine  these  three  parameters  is  still  under 
investigation,  the  experimental  results,  which  will  be  given  in  Chapter  4,  show  that  when 
we  model  a chaotic  signal  with  large  positive  Lyapunov  exponents  this  method  can  really 
perform  much  better  in  capturing  the  signal  dynamics  than  a one-step  predictor.  We  suc- 
ceeded for  the  first  time  to  model  the  Lorenz  attractor  using  a predictive  ANN  model. 
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2 5 3 Model  Validation  Using  Dynamical  Invariants 

For  non-chaotic  signals,  their  dynamic  models  can  be  validated  in  the  time  domain 
or  in  the  frequency  domain.  In  the  time  domain,  the  iterates  of  the  model  is  compared  with 
the  original  signal  sample  by  sample.  For  a successful  model,  the  output  of  the  model 
should  be  very  close  to  the  original  signal  for  a specified  period  of  time.  In  the  frequency 
domain,  we  compare  the  spectra  of  both  signals.  The  locations  and  the  amplitudes  of  the 
fundamental  frequencies  for  the  original  signal  and  the  model  output  should  be  identical. 

On  the  other  hand,  the  validation  of  a chaotic  dynamic  model  is  not  so  straightfor- 
ward. In  the  frequency  domain,  a chaotic  time  series  has  a wideband  spectrum.  It  becomes 
very  difficult  to  identify  the  fundamental  frequencies.  Another  difficulty  arises  due  to  the 
positive  Lyapunov  exponents  possessed  by  chaotic  systems.  If  a model  captures  the  recon- 
structed dynamics,  the  model  itself  is  a chaotic  system.  A prediction  error  of  the  model  (a 
dynamic  model  is  also  a predictive  model  of  the  signal)  will  be  amplified  in  the  later  itera- 
tions. Given  an  initial  condition,  the  iterates  of  the  model  will  eventually  diverge  from  the 
original  signal.  Therefore,  a sample-by-sample  comparison  can  not  be  used  in  the  model 
validation  for  chaotic  time  series. 

However,  according  to  eq.  2. 5. 2. 2-5,  we  can  compute  a reference  curve  for  the 

divergence  of  the  output  of  a model  and  the  original  signal.  This  curve  will  be  called  Cas- 

dagli’s  conjecture  curve.  Therefore,  in  the  time  domain,  we  can  validate  the  model  by 

comparing  the  curve  of  the  average  multi-step  prediction  errors  of  the  model  and  Casdag- 

li’s  conjecture  curve.  For  a successful  model,  both  curves  should  be  close.  The  average 

multi-step  prediction  error  is  computed  by 

F 

N-k 

C-xW  = ^ y [x{iFk)-x{i  + k)]~ 

F a^{N-k)  , 

/ = 1 


eq.  2. 5. 3-1 
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2 

where  k is  the  prediction  step,  a is  the  variance  of  the  original  signal,  N is  the  total  num- 
ber of  data  samples  in  the  signal,  x{i  + k)  is  a k-step-ahead  iterative  prediction  of  the 

model  F . 

This  multi-step  prediction  error  criterion  has  its  practical  importance,  but  it  still 
belongs  to  the  category  of  the  sample-by-sample  comparison  (even  although  we  compute 
the  average).  Using  this  method,  we  can  only  examine  the  local  approximation  of  the 
model.  In  this  work,  I propose  that  the  model  validation  should  include  the  comparison  of 
the  measurements  of  the  dynamical  invariants.  If  a model  really  captures  the  underlying 
dynamics  of  a chaotic  time  series,  a trajectory  reconstructed  from  the  iterates  of  the  model 
should  have  the  same  dimension  and  Lyapunov  exponents  as  the  original  system  attractor. 
To  summarize,  besides  using  visual  inspection  of  the  waveforms  and  the  spectra,  the 
model  validation  for  chaotic  time  series  should  include  the  comparison  of  the  average 
multi  step  prediction  errors  and  the  measurements  of  the  dynamical  invariants. 


CHAPTER  3 

ARTIFICIAL  NEURAL  NETWORKS  IN  DYNAMIC  MODELING 


3 1 Introduction 

In  Chapter  2,  a dynamic  model  has  been  formulated  as  a predictive  model  of  a 
given  signal.  The  iterates  of  the  model  are  required  to  preserve  the  same  dynamical  prop- 
erties of  the  underlying  system  that  generated  the  signal.  As  discussed  in  section  2.4.2,  this 
model  can  be  represented  by 

jc(/+l)  = i^(x(0)  eq.  3.1-1 


nT 


x(t-2m)  ...  x(t-  1)  x(t) 


where  x(t)  is  the  data  sample  of  the  signal.  Therefore,  the  task  in  dynamic  modeling  is  to 
approximate  the  function  subject  to  some  constraints  imposed  on  its  compositions. 
Since  the  signal  is  assumed  to  be  the  output  of  a nonlinear  system,  a nonlinear  modeling 
tool  is  definitely  required. 

There  are  only  a handful  of  nonlinear  modeling  tools  whose  function  approxima- 
tion capabilities  have  been  intensively  studied.  Polynomial  approximation  is  among  them 
and  has  been  a popular  choice.  However,  in  polynomial  approximation  we  need  to  choose 
a set  of  bases  (monomials)  first.  The  coefficients  of  these  monomials  are  then  computed 
such  that  the  approximation  error  is  minimized.  The  performance  of  this  approximation 
usually  relies  heavily  on  the  choice  of  the  bases.  This  problem  (the  dependence  of  the  per- 
formance on  the  selection  of  the  bases)  has  been  shared  by  several  nonlinear  approxima- 
tion methods. 
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Artificial  Neural  Networks  (ANNs)  are  an  alternative  nonlinear  modeling  tool,  and 
they  have  been  applied  successfully  to  the  function  approximation  problem  [Lap87] 
[Jon90b][Rum86][Hus93],  One  of  the  most  important  properties  of  ANNs,  which  distin- 
guishes them  from  the  polynomial  approximation  or  its  variations,  is  its  plasticity.  This 
means  that  an  ANN  can  develop  the  necessary  representations  of  a mapping  through 
learning.  Thus,  the  problem  of  choosing  proper  bases  does  not  exist  in  the  ANN  approach. 

The  study  of  artificial  neural  networks  was  inspired  by  their  biological  counterpart 
[McC43].  A neural  network  usually  consists  of  many  processing  elements  (PEs),  each  of 
which  implements  a simple,  identical  nonlinear  activation  function  on  the  weighting  sum 
of  its  input  signals.  These  PEs  are  connected  to  one  another  in  a highly  parallel  structure. 
The  connection  strength  between  each  pair  of  PEs  can  be  determined  through  training. 

One  of  the  most  popular  network  stmctures  is  to  organize  processing  elements  into 
layers  with  feedforward  connections  only  between  one  layer  and  the  next.  In  this  architec- 
ture, there  is  a special  layer  that  buffers  the  input  signals.  It  is  labeled  as  the  input  layer  of 
the  network.  When  we  count  the  number  of  layers  in  a network,  the  input  layer  is  usually 
not  included.  The  layers  between  the  input  and  the  output  of  the  network  are  named  hid- 
den layers  because  they  are  invisible  from  the  outside  world.  If  a network  possesses  more 
than  one  layer  of  processing  elements,  we  call  it  Multi-Layer  Perceptron  (MLP).  A sche- 
matic diagram  of  a two-layer  MLP  is  shown  in  Figure  3-1. 
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Figure  3-1.  Schematic  diagram  of  a two-layer  MLR 


In  recent  years,  several  different  research  groups  have  studied  the  function  approx- 
imation capability  of  neural  networks  [Hor89][Gal88][Hec89][Iri88][Gir89].  Gallant  and 
White  showed  that  a single  hidden  layer  network  using  the  cosine  squasher  activation 
function  is  a special  case  of  Fourier  network  which  yields  a Fourier  series  approximation 
to  a given  function  as  its  output[Gal88],  Thus,  such  networks  possess  all  the  approxima- 
tion properties  of  Fourier  series  representation.  Hornik  et  al.  made  use  of  the  Stone-Weier- 
strass  Theorem  to  establish  that  standard  MLR  architectures  using  arbitrary  squashing 
functions  can  approximate  virtually  any  function  of  interest  to  any  desired  degree  of  accu- 
racy, provided  sufficiently  many  hidden  units  are  available  [Hor89].  Hecht-Nielsen  sug- 
gested that  Kolmogorov's  superposition  theorem  or  its  most  recent  improvement  [Lor76] 
could  be  used  to  justify  that  any  L2  (Lebesgue  square  integrable)  function  can  be  imple- 
mented to  any  desired  degree  of  accuracy  with  a two-layer  MLR  [Hec89].  Although, 
Girosi  and  Roggio  questioned  the  legitimacy  of  the  association  of  Komolgorov's  theorem 
with  the  function  approximation  capability  of  neural  networks[Gir89],  many  experimental 
results  in  recent  years  have  shown  that  artificial  neural  networks  really  outperform  con- 
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ventional  nonlinear  models  in  many  function  approximation  applications  [Lap87][Jon90- 
b][Mead91][Wei90]. 

An  artificial  neural  network  can  be  regarded  as  a dynamical  system  [Pin88],  The 
relaxation  of  neural  networks  can  exhibit  a rich  variety  of  dynamical  behaviors  [Cha92] 
[van90].  This  property  is  highly  desirable  in  dynamic  modeling,  because  in  eq.  3.1-1  we 

not  only  require  a model  to  approximate  the  mapping  F^,  but  also  expect  that  the  iterates 
of  the  model  can  preserve  the  dynamical  properties  of  the  original  system.  The  possible 
dynamics  of  a selected  model  have  to  be  rich  such  that  the  model  can  be  tailored  to 
approximate  any  desired  dynamics.  On  the  other  hand,  when  we  iterate  a polynomial 
model  its  outputs  usually  blow  up  quickly  [Lap87].  Therefore,  the  polynomial  approxima- 
tion is  generally  used  only  to  model  local  behavior  of  a dynamical  system  or  low-dimen- 
sional attractors  [Far87][Eis91]. 

Another  advantage  of  using  ANNs  is  due  to  their  ability  to  generalize  what  they 
learn  during  training  to  new  situations.  If  the  signal  to  be  modeled  is  noisy  and  has  finite 
length,  it  is  desirable  that  a model  is  able  to  interpolate  and  extrapolate  the  mapping  from 
the  training  examples  in  a sensible  way. 

To  summarize,  due  to  their  plasticity,  function  approximation  capability,  wide 
spectrum  of  possible  dynamics,  and  generalization  capability,  artificial  neural  networks 
become  an  attractive  nonlinear  modeling  tool  for  dynamic  modeling.  Therefore,  in  this 
research  ANNs  are  mainly  used  as  the  modeling  tool. 

In  this  chapter,  I will  examine  ANNs  according  to  the  needs  of  dynamic  modeling. 
An  artificial  neural  network  can  be  specified  by  the  activation  functions  of  its  PEs,  its  con- 
nection topology,  and  the  learning  algorithm  for  the  adaptation  of  its  parameters.  In  the 
following  sections,  I start  with  the  discussion  of  the  functions  of  a single  PE.  Then,  the 
discussion  will  be  extended  to  the  network  dynamics  in  different  connection  structures. 
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For  dynamic  modeling,  a network  topology  is  proposed.  The  memory  mechanism  of  neu- 
ral networks  will  be  studied.  At  the  end  of  this  chapter,  I compare  two  trajectory  learning 
algorithms,  and  propose  a modified  learning  algorithm.  The  experimental  results  of  apply- 
ing ANNs  to  dynamic  modeling  will  be  given  in  Chapter  4. 

3.2  Functions  of  a Single  Processing  Element 
In  a biological  neural  network,  the  transmission  of  a signal  from  one  cell  to  another 
at  a synapse  is  a complex  chemical  process  in  which  specific  transmitter  substances  are 
released  from  the  sending  side  of  the  synaptic  junction.  The  effect  is  to  raise  or  lower  the 
electrical  potential  inside  the  body  of  the  receiving  cell.  If  a cell  collects  enough  these  sub- 
stances from  neighboring  cells  such  that  its  electrical  potential  reaches  a threshold,  a train 
of  pulses  or  action  potentials  will  be  sent  down  to  its  axon  which  may  make  a few  thou- 
sand synapses  with  other  neurons.  We  then  say  the  cell  has  fired. 

Each  processing  element  (PE)  in  the  parallel  structure  of  an  ANN  implements  a 
simple  sum-and-threshold  activation  function,  which  mimics  the  firing  process  of  biologi- 
cal neurons.  A PE  computes  the  weighted  sum  of  its  input  signals  from  other  PEs  and  then 
applies  a threshold  function  to  the  result.  Therefore,  the  activation  function  of  a PE  can  be 
expressed  by 

x.(t)  ^ G (mt.it))  ^and  eq.  3.2-1 

net.(t)  = J^^'ijXj(t) 

j 

where  x.  (t)  is  the  output  of  i-th  PE,  x.  (t)  can  be  the  output  of  the  j-th  PE  or  an  external 
input,  a ( ) is  a threshold  function,  and  w is  the  strength  of  the  connection  from  j-th  PE 
to  i-th  PE.  To  be  able  to  apply  gradient  based  optimization  procedures  to  train  neural  net- 
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works,  the  threshold  function  is  usually  approximated  by  a differentiable  sigmoidal  func- 
tion. A functional  diagram  of  a PE  is  shown  in  Figure  3-2. 


Figure  3-2.  Functional  diagram  of  a PE 


Since  all  of  the  connection  weights  are  adjustable,  the  LHS  (left-hand  side)  of  the 
PE  in  Figure  3-2  acts  as  an  adaptive  linear  combiner  [Wid85].  In  adaptive  signal  process- 
ing, this  system  is  applied  to  obtain  the  best  representation  of  a desired  signal  in  the  input 
signal  space,  i.e.  span(  {x.(t) } ) [Hon84].  And,  the  RHS  of  the  PE  is  a sigmoidal  func- 
tion, i.e.  a nonlinear  mapper.  In  this  formulation,  a PE  can  be  viewed  an  adaptive  nonlinear 
combiner  of  its  input  signals. 

To  understand  the  role  of  the  activation  of  a PE  in  a function  approximation  prob- 
lem, let  us  place  this  PE  into  the  hidden  layer  of  a two-layer  MLP  as  shown  in  Figure  3-1 . 
And,  assume  that  the  output  PE  is  a linear  unit.  Then,  the  output  of  the  network,  (t) , 

can  be  computed  by 

eq.  3.2-2 

i i 


Now,  net.(t)  becomes  the  weighed  sum  of  external  inputs  to  the  i-th  hidden  PE.  In  eq. 
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3.2-2,  the  activation  functions  of  all  hidden  PEs  can  be  regarded  as  the  bases  of  a linear 
space,  and  the  output  of  the  network  is  just  a linear  combination  of  these  bases.  For  a 
given  desired  output  signal,  the  training  of  the  network  is  to  find  the  best  representation, 
i.e.  the  orthogonal  projection,  of  the  signal  in  this  linear  space.  A geometrical  interpreta- 
tion is  shown  in  Figure  3-3,  provided  there  are  only  two  hidden  PEs.  Polynomial  approxi- 
mation can  also  be  formulated  in  this  way  when  we  replace  each  activation  function  with  a 
monomial  term.  But,  as  we  mentioned  above  the  basis  functions  provided  by  a neural  net- 
work is  an  adaptive  nonlinear  combiner  of  the  external  input  signals.  In  other  words,  the 
required  basis  functions  to  best  represent  a desired  output  signal  can  be  obtained  through 
training. 


Figure  3-3.  Geometrical  Interpretation  of  eq.  2.3-2. 


In  eq.  3.2-1,  we  assume  that  the  threshold  of  the  activation  function  is  zero.  Gener- 
ally, the  threshold  can  be  any  value.  However,  if  we  assume  that  there  is  a PE  whose  out- 
put is  always  1,  the  threshold  can  be  regarded  as  the  strength  of  the  connection  from  this 
imaginary  PE  to  the  PE  of  interest.  Therefore,  eq.  3.2-1  is  still  valid  when  the  threshold  is 


taken  into  account. 
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Another  assumption  in  eq.  3.2-1  is  that  all  of  the  PEs  in  a network  will  be  activated 
synchronously.  There  are  some  works  reported  in  the  literature  in  which  PEs  are  activated 
according  to  some  prescribed  probabilities  [Hop82a].  However,  there  is  no  theoretical 
support,  except  biological  plausibility,  to  show  that  this  activation  asychronization  can 
yield  a better  performance.  And,  to  synchronize  the  activations,  we  are  able  to  simplify  the 
simulation  and  the  analysis  of  network  behaviors. 

Due  to  the  simple  form  of  their  derivatives,  the  logistic  function, 

/p  = 1/(1+  e~^^) , and  the  hyperbolic  function,  tanh(Px),  are  frequently  chosen  as 

the  sigmoidal  activation  function  of  PEs.  Here,  P denotes  the  gain  of  the  sigmoidal  func- 
tion. The  derivative  of  the  logistic  function,  /g  (x) , is  P/p  (x)  ( 1 -/p  (x)  ) , and  that  of  the 

hyperbolic  function,  tanh(Px),  is  P(1  -tanh^Px).  These  two  functions  can  be  related 
with  each  other  by 

tanh(Px)  = 2/2p(x)  - 1 eq.  3.2-2 


Therefore,  the  selection  of  either  one  of  these  sigmoidal  functions  should  not  affect  the 
performance  of  a network. 

Besides  sigmoidal  functions,  another  popular  choice  of  the  activation  function  is 
the  radial  basis  function,  which  has  the  same  form  as  a gaussian  function  and  can  be 
expressed  by 


g(x)  = e 


eq.  3.2. -3 


2 

where  x is  the  input  vector,  and  x and  a are  the  center  and  the  variance  of  the  gaussian 


function  respectively,  and  ||  ||  is  a norm  to  measure  distance.  In  this  formulation,  the 


weighted  inputs  to  a PE  at  each  time  instance  are  presented  in  a vector  form  to  compute 
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the  activation.  Each  individual  PE  with  this  activation  function  provides  a local  receptive 
field.  For  a given  input  vector,  there  are  only  a small  fraction  of  network  PEs  with  centers 
close  to  this  input.  They  are  the  only  ones  that  will  respond  with  activations  which  differ 
significantly  from  zero.  Thus,  only  the  connection  weights  of  these  activated  PEs  need  to 
be  trained  [Moo89].  This  may  result  in  a fast  convergence  in  training.  However,  the  per- 
formance of  the  network  with  this  activation  function  relies  heavily  on  the  locations,  the 
variances  and  the  number  of  the  radial  basis  functions.  Usually,  we  determine  these  design 
parameters  according  to  the  distribution  of  the  input  vectors,  estimated  by  using  a dusting 
algorithm  (e.g.  K-means  algorithm  [Dud73]). 

For  function  approximation  problems,  there  is  no  theoretical  basis  to  prefer  a sig- 
moidal function  to  a radial  basis  function.  However,  if  a neural  network  includes  some 
feedback  connections,  to  determine  the  design  parameters,  i.e.  the  centers  and  the  vari- 
ances, of  the  radial  basis  functions  will  become  very  complicated.  And,  in  dynamic  mod- 
eling, the  training  of  the  iterative  map  of  a model  relies  on  these  feedback  connections. 
Therefore,  in  this  work,  I always  use  the  sigmoidal  function  as  the  activation  function  of 
PEs. 

3.3  Connection  Topology  of  ANNs 


An  artificial  neural  network  is  a distributed  system.  Each  individual  processing 
element,  which  implements  a simple  nonlinear  function,  is  connected  to  other  processing 
elements  in  the  network.  In  the  discrete-time  case,  the  collective  behavior  of  a neural  net- 


work can  be  expressed  in  a matrix  form, 
X{t)  = g(WXO-  1)) 


eq.  3.3-1 


X(t)  = 


Xj  (0 

x^(t) 


W = [w..] 

L /jJ 


• • • 
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where  X(t)  is  the  activation  vector  of  the  network  at  time  step  t,  W is  the  weight  matrix, 

and  o ( ) is  the  vector  of  activation  functions.  We  note  that  eq.  3.3-1  represents  a dynam- 
ical system  whose  system  states  are  the  activations  of  the  PEs  in  the  network.  With  differ- 
ent weight  matrices,  a network  can  exhibit  a wide  variety  of  dynamical  behaviors. 

The  weight  matrix  also  represents  the  connection  topology  of  a neural  network. 
For  an  MLP,  if  we  number  its  PEs  according  to  the  forward  sequential  order  of  the  layer 
where  they  are  located,  the  weight  matrix  will  become  a lower  triangular  matrix.  Since 
there  is  no  feedback  connection,  this  structure  provides  no  mechanism  to  retain  the  infor- 
mation from  the  past.  This  means  that  the  mapping  developed  by  an  MLP  is  memoryless. 
The  output  of  the  network  is  a function  of  the  current  input  pattern  only.  Networks  of  this 
property  are  called  static  networks  [Hus93].  On  the  other  hand,  for  those  neural  networks 
with  feedback  connections  their  weight  matrices  can  not  be  arranged  into  a lower-triangu- 
lar form.  And,  the  current  activations  of  these  networks  will  be  affected  by  the  previous 
activations  via  feedback  connections.  This  property  brings  in  the  concept  of  dynamics. 
Therefore,  networks  with  any  feedback  connections  are  called  dynamic  networks  or  recur- 
rent networks  [Hus93]. 

In  dynamic  modeling,  we  try  to  approximate  the  mapping  F'^,  which  maps  the  data 

samples  in  the  past  to  the  current  data  sample  for  a given  time  series.  In  other  words,  the 
model  is  required  to  be  able  to  predict  the  current  data  sample  based  on  a number  of  previ- 
ous data  samples.  Therefore,  a short-term  memory  structure  must  be  embedded  in  the  net- 
work topology  to  retain  the  previous  data  samples  for  prediction.  For  an  MLP,  this  can  be 
implemented  by  replacing  its  input  layer  with  a delay  line.  The  resulting  structure  is  called 
Time-Delay  Neural  Network  (TDNN)  [Koh89].  The  schematic  diagram  of  a TDNN  is 
given  in  Figure  3-4.  Equipped  with  a delay  line  in  its  input  layer,  a TDNN  is  able  to  con- 
vert a temporal  sequence  into  a network  input  vector.  Basically,  a TDNN  is  still  an  MLP. 
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The  function  approximation  capability  of  the  MLP  has  been  extensively  studied  [Hor89] 
[Gal88][Hec89][Iri88]  and  has  been  shown  superior  to  that  of  the  polynomial  approxima- 
tion [Lap87],  Therefore,  a TDNN  can  be  a powerful  nonlinear  mapper.  In  Figure  3-4,  I 
also  show  an  adaptive  scheme  to  train  a TDNN  as  a one-step  signal  predictor. 

However,  as  we  discussed  in  section  2.4.2  during  the  training  of  a dynamic  model 
we  should  impose  some  constraints  on  the  iterative  map  of  the  model  such  that  the  model 
can  learn  the  underlying  dynamic  better.  To  achieve  this,  I propose  to  add  an  output  feed- 
back connection  to  the  input  delay  line  of  a TDNN,  and  train  a network  of  this  structure  as 
a dynamic  model.  This  structure  is  shown  in  Figure  3-5,  and  I will  call  it  TDNNGF 
(TDNN  with  global  feedback).  We  note  that  in  the  feedback  connection  of  this  structure 
the  network  output  is  just  delayed  one  time  unit  before  being  clocked  into  the  input  delay 
line.  An  adaptive  scheme  to  train  a TDNNGF  as  a dynamic  model  is  also  shown  in  Figure 
3-5.  Therefore,  the  forward  structure  of  a TDNNGF  is  a powerful  adaptive  mapper,  and  its 
feedback  connection  allows  us  to  refine  its  iterative  map  during  training. 
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Figure  3-4.  TDNN  in  an  adaptive  scheme  for  the  training  of 
a one-step  predictor. 
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Figure  3-5.  TDNNGF  in  an  adaptive  scheme  for  the  training  of 
a dynamic  model. 


In  fact,  any  dynamic  networks  with  a global  feedback  connection  can  be  used  as  a 
modeling  tool  in  this  work.  A temporal  sequence  required  for  prediction  is  retained 
through  the  feedback  structures  of  these  networks.  Therefore,  the  training  of  these  net- 
works not  only  needs  to  develop  a required  mapping,  but  also  needs  to  create  an  implicit 
memory  structure  to  store  previous  data  samples.  Consequently,  the  roles  of  the  PEs 
become  very  difficult  to  identify.  A most  generalized  dynamic  network  has  a fully-con- 
nected recurrent  structure.  At  the  end  of  this  Chapter,  I will  apply  this  structure  to  model  a 
limit  cycle  using  a proposed  trajectory  learning  algorithm. 

For  a TDNN,  its  input  layer  is  just  a delay  line.  The  order  of  the  delay  line  repre- 
sents the  number  of  the  data  samples  that  we  can  retain  in  the  network.  To  decouple  the 
order  of  a memory  and  its  memory  depth,  a memory  structure  with  global  feedforward  and 
local  recurrent  connections  has  been  proposed  [deV92].  This  memory  structure  is  referred 
to  as  gamma  memory.  In  the  next  subsection,  we  will  discuss  the  gamma  memory.  The 
application  of  this  memory  structure  in  this  work  can  be  found  in  section  3.4.4  and  4.4. 
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In  section  3.3.2, 1 propose  a method  to  estimate  the  output  of  a single  PE  with  self- 
feedback connection  graphically.  And,  a phase  portrait  of  the  PE  dynamics  can  also  be 
obtained  in  this  method. 


3 .3.1  Linear  Memory  Structures 

To  memorize  the  activations  or  the  input  data  samples,  a straightforward  solution 
is  to  attach  delay  lines  to  the  processing  elements  of  a neural  network.  The  signals  at  the 
taps  of  the  delay  lines  can  be  fed  to  another  processing  elements.  To  simplify  the  discus- 
sion, let  us  assume  that  i-th  PE  receives  signals  only  from  the  outputs  of  a delay  line.  The 
weighted  sum,  net.  (/) , computed  by  i-th  PE  can  be  expressed  by 


K 

net.(t)  = '^w.xit-j) 

7 = 0 


eq.  3.3. 1-1 


where  K is  the  order  of  the  delay  line,  and  x (t)  is  the  signal  clocked  into  the  delay  line. 
We  note  that  eq.  3.3. 1-1  represents  an  FIR  filter.  In  this  implementation,  the  number  of  the 
data  samples  that  can  be  stored  in  a delay  line  is  limited  by  the  order  of  the  delay  line.  A 
solution  to  decouple  the  order  of  a linear  memory  and  the  memory  depth  that  it  can  pro- 
vide is  to  introduce  some  feedback  connections  in  the  memory  structure.  From  the  point  of 
view  of  digital  filtering.  This  is  a change  from  an  FIR  (finite  impulse  response)  filter  to  an 
HR  (infinite  impulse  response)  filter. 

In  the  study  of  short-term  memory,  a special  IIR-type  memory  structure  was  pro- 
posed by  de  Vries  and  Principe  [deV92][Pri93a].  This  memory  structure  was  named 
gamma  memory.  A schematic  diagram  of  a gamma  memory  is  shown  in  Figure  3-6.  We 
note  that  this  memory  has  both  global  feedforward  and  local  feedback  connections.  If  the 
parameter  p is  set  to  I,  the  structure  becomes  a delay  line.  Thus,  the  delay  line  can  be 
regarded  as  a special  case  of  this  formalism.  The  transfer  function  at  k-th  tap  of  a gamma 
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memory  is  given  by 


GAz)  = 


^z-  (1  -n) 


eq.  3.3. 1-2 


where  is  the  parameter  of  the  gamma  memory.  This  memory  filter  has  a single  pole  at 
(l-|o.)  with  the  multiplicity  of  k.  Since  we  can  control  the  stability  of  this  memory  filter  by 
limiting  the  value  of  p in  the  range  of  (0,2),  the  parameter  p can  also  be  adjusted  with  con- 
nection weights  during  training.  Later,  when  we  define  the  memory  depth  for  an  IIR-type 
memory  structures,  we  will  find  that  this  is  an  important  feature  of  the  gamma  memory. 


Figure  3-6.  Discrete-time  gamma  memory. 


Figure  3-7.  Continuous-time  gamma  memory. 

The  gamma  memory  was  first  derived  in  the  continuous-time  domain.  The  contin- 
uous-time version  of  this  memory  is  shown  in  Figure  3-7  [deV92].  The  transfer  function  at 
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each  stage  is 


H(s) 


5 + |1 


eq.  3.3. 1-3 


We  notice  that  this  structure  is  actually  a Poisson  Filter  Chain  with  a constant  gain.  The 
impulse  response  at  k-th  tap  of  this  continuous-time  memory  is  given  by 


gk(0  = 


It 


(^-1)! 


.k  - 1 -11/ 
t e 


eq.  3.3. 1-4 


Since  this  expression  is  just  the  integrand  of  the  normalized  F-function,  this  memory  filter 
is  referred  to  as  gamma  memory.  The  impulse  response  at  each  tap  of  a gamma  memory  is 
called  z.  gamma  kernel.  A set  of  gamma  kernels  have  the  following  property 

— = -pg,  eq.  3.3. 1-5 

^ = - Itg^  + l-ig*  - 1 k = 2,3,...,K 

where  K is  the  order  of  the  gamma  memory.  For  a given  input  signal  x(t),  the  output  of  a 

gamma  memory  at  k-th  tap  is  called  a gamma  state,  (0  • It  is  computed  by 

/ 

Xk{t)  = \x{x)g^{t-x)dx  eq.3.3.1-6 

0 

where  Xq  (/)  = x{t) . Taking  the  time  derivative  of  eq.  3.3. 1-6  and  applying  Leibniz'  rule, 
we  have 

/ 

= Jx(T)^^^(/-x)//x+g^(0)x(0 
0 

= -px^(0  (0 


eq.  3.3. 1-7 


93 


with  initial  conditions 


g^(0)  = 0 k>2 

gi(0)  = [i 


eq.  3.3. 1-8 


The  second  equality  in  eq.  3.3. 1-8  can  be  obtained  from  eq.  3.3. 1-4.  For  the  discrete-time 
case,  the  first  order  forward  difference  is  used  to  approximate  the  derivative,  and  the  eq. 
3.3. 1-7  becomes 


This  formulation  suggests  the  structure  given  in  Figure  3-6.  Since  later  on  we  only  discuss 
the  discrete-time  gamma  memory,  I will  use  the  same  notation,  (/) , to  represent  the 

gamma  kernel  at  k-th  tap  of  the  discrete-time  gamma  memory. 

When  we  use  a delay  line  as  a memory  device,  the  memory  depth  and  the  memory 
resolution  can  be  defined  in  a straightforward  way.  The  former  is  the  number  of  samples 
stored  in  the  delay  line,  and  the  latter  is  the  number  of  samples  stored  in  each  delay  unit, 
which  is  1 for  a delay  line.  On  the  other  hand,  the  definitions  for  these  two  quantitative 
measurements  for  the  gamma  memory  is  not  obvious.  The  gamma  kernels  for  p.=0.7  are 
shown  in  Figure  3-8.  We  note  that  the  kernels  have  dispersive  waveforms.  Since  the  struc- 
ture of  the  gamma  memory  has  an  infinite  impulse  response,  the  following  definitions  to 
characterize  the  memory  were  proposed  [deV92].  First,  the  mean  sampling  time  for  the 

k-th  tap  is  defined  as 


= (1 1) +px^_i(/- 1) 


eq.  3.3. 1-9 


00 


eq.  3.3.1-10 
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The  mean  sampling  period  at  tap  k can  be  estimated  as 


= ^k-^k-x 


I 


eq.  3.3.1-11 


We  note  that  this  expression  of  the  mean  sampling  period  is  independent  of  k.  The  mean 
memory  depth  for  a gamma  memory  of  order  K is  then  defined  as 


K 


D = 


I = 1 


K 

11 


eq.  3.3.1-12 


The  resolution  is  defined  as  the  reciprocal  of  the  mean  sampling  period,  which  is  com- 
puted by 


eq.  3.3.1-13 


As  an  example,  if  we  have  a 5th  (K=5)  order  gamma  memory  with  p— 0.5,  the 
memory  depth  is  K/p  = 5/0.5  = 10,  and  the  resolution  is  equal  to  0.5.  According  to  eq. 
3.3.1-12,  the  memory  depth  of  a K-th  order  gamma  memory  can  be  adjusted  by  changing 
p.  In  other  words,  the  order  of  the  memory  and  the  memory  depth  can  be  decoupled.  How- 
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ever,  from  eq.  3.3.1-13  we  note  that  this  decoupling  is  achieved  at  the  expense  of  data  res- 
olution. 

As  we  mentioned  before,  the  parameter  p can  be  adapted  during  training.  A train- 
ing algorithm  for  this  parameter  based  on  gradient  descend  method  was  proposed  by  de 
Vries  et.  al.  [deV92].  According  to  eq.  3.3. 1-12,  this  adaptation  can  be  regarded  as  an  opti- 
mization of  the  required  memory  depth  and  data  resolution.  We  note  that  a general  IIR- 
type  memory  can  not  provide  this  adaptive  memory  depth  because  adjusting  feedback 
coefficients  in  an  IIR  filter  may  cause  instability.  From  the  point  of  view  of  signal  repre- 
sentations, this  adaptability  also  has  its  advantage.  This  advantage  can  be  understood  from 
a geometrical  interpretation  of  eq.  3.3. 1-1.  The  signal  net.{t)  is  a linear  combination  of 

the  signals  at  the  taps  of  a delay  line.  Therefore,  we  can  consider  the  signals  at  the  taps  as 
the  basis  functions  of  a signal  space.  This  space  was  called  memory  space  [Pri93e].  Since 
these  basis  functions  are  just  the  input  signal  and  its  delayed  versions,  the  memory  space 
of  a K-th  order  delay  line  is  completely  determined  for  a given  input  signal.  Assume  that 
there  is  a desired  signal  for  net.  (t)  (Later,  when  we  discuss  the  backpropagation  algo- 


rithm we  find  that  this  desired  signal  can  really  be  obtained  during  training.).  Then,  the 
adaptation  of  the  connection  weights  w..  is  just  to  find  the  orthogonal  projection  of  this 


desired  signal  on  the  memory  space.  The  geometrical  interpretation  is  shown  in  Figure  3- 
9(a).  On  the  other  hand,  if  we  replace  the  delay  line  with  a gamma  memory,  the  signal  at 
the  k-th  tap  is  the  convolution  of  k-th  gamma  kernel  and  the  input  signal.  And,  eq.  3.3. 1-1 
needs  to  be  modified  as 


K 

7 = 0 


eq.  3.3.1-14 


where  * is  the  convolution  operator.  Since  gamma  kernels  are  functions  of  p,  the  basis 
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functions  of  the  memory  space  now  depends  on  the  input  signal  as  well  as  the  parameter 
p.  For  a given  input  signal,  to  reduce  the  error  between  a desired  signal  and  its  orthogonal 
projection  on  the  memory  space  we  need  to  increase  the  order  of  a delay  line.  But,  if  we 
use  a gamma  memory  this  error  reduction  can  be  achieved  by  adjusting  the  parameter  p 
without  increasing  the  order  of  the  memory.  In  other  words,  we  can  rotate  the  memory 
space  towards  the  desired  signal.  The  geometrical  interpretation  for  this  adjustment  is 
shown  in  Figure  3-9(b). 


Figure  3-9.  Geometrical  interpretation  of  memory  space 
(a)  for  a delay  line,  (b)  for  a gamma  memory. 

We  note  that  the  above  discussion  is  based  on  the  assumption  that  the  PE  receives 
signals  only  from  the  outputs  of  a single  memory.  When  the  PE  receives  signals  from  the 
outputs  of  several  memories,  this  approach  becomes  impractical. 

In  section  3.4.4,  we  will  use  the  gamma  memory  in  a modified  trajectory  learning 
algorithm.  And,  in  section  4.4  I proposed  a multi-channel  MLP,  whose  input  units  are 
equipped  with  gamma  memories,  as  a state  predictive  model  to  reduce  noise.  In  both 
applications,  the  feature  of  adjustable  memory  depth  provided  by  the  gamma  memory 
plays  an  important  role. 
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3 3.2  Nonlinear  Memory  Structures 

In  a recurrent  network,  there  is  no  explicit  memory  structure.  The  data  samples  in 
the  past  are  contained  in  the  activations  of  network  PEs.  Due  to  the  nonlinearity  and  mas- 
sive connections,  it  is  very  difficult  to  analyze  how  a network  stores  a specific  data  sam- 
ple. Therefore,  in  this  section  I propose  a method  to  study  only  how  a single  PE  with  self- 
recurrency retains  data  samples. 

For  an  IIR-type  linear  memory,  the  signal  at  a memory  tap  is  the  convolution  of  the 
input  signal  with  the  impulse  response  at  that  tap.  We  can  compute  the  convolution 
directly  to  obtain  the  signal,  or  we  can  construct  the  output  signal  graphically  by  using  the 
folding  and  sliding  technique  to  represent  convolution  operation  [McG84].  For  a nonlinear 
PE,  the  impulse  response  becomes  meaningless.  The  output  of  the  PE  can  be  computed 
according  to  its  activation  function.  But,  this  computation  lacks  a vivid  view  about  how 
data  samples  are  processed  in  a single  PE.  Therefore,  I propose  a method  to  construct  the 
output  signal  of  a single  PE  in  a phase  plane. 

Before  the  method  is  introduced,  let  us  reformulate  the  activation  of  a PE.  For  a 
discrete-time  case,  the  activation  of  i-th  PE  can  be  computed  by 
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eq.  3. 3. 2-1 


where  f(0  - 

j*i 
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For  this  moment,  let  us  assume  I(t)  equals  to  0.  Then,  eq.  3. 3. 2-1  becomes 

x{t)  - c{wx(t-  1)) 


eq.  3. 3. 2-2 


To  simplify  the  notation,  I drop  subscripts.  We  note  that  eq.  3. 3. 2-2  is  an  iterative  map.  For 
a given  initial  condition,  we  can  compute  a sequence.  We  can  also  construct  a phase  por- 
trait for  this  iterative  map  [Dev89].  Assume  w is  positive,  and  the  construction  procedure 

is  as  follows: 

1)  Plot  the  curve  of  the  activation  function  (a  sigmoid)  in  phase  plane,  x(t)-x(t-l) 
plane. 

2)  Plot  the  diagonal  line  x(t)  = x(t-l). 

3)  Draw  a vertical  line  across  x(0).  The  ordinate  of  the  intersection  point  between  this 
line  and  the  sigmoid  is  x(l). 

4)  Draw  a horizontal  line  across  x(l).  The  intersection  point  between  this  line  and  the 
diagonal  line  will  be  used  as  the  point  to  evaluate  the  next  iterate. 

5)  Draw  a vertical  line  across  the  point  obtained  in  the  previous  step  to  the  sigmoid. 
The  ordinate  of  the  intersection  point  is  x(2). 

6)  Replace  x(l)  with  x(i),  and  Repeat  step  4 and  5 to  find  the  next  iterates  x(i+l). 


In  Figure  3-10,  I show  two  examples  of  this  construction.  In  Figure  3-10(a),  the 
maximal  slope  of  the  sigmoid  is  less  than  1.  The  diagonal  line  and  the  sigmoid  only  inter- 
sect at  one  point  (the  origin).  Since  the  iterates  will  approach  this  point,  the  point  is  an 
attractor.  On  the  other  hand,  when  the  maximal  slope  of  the  sigmoid  is  larger  than  1,  there 
are  three  intersection  points  (Figure  3-10(b)).  The  iterates  will  leave  the  one  in  the  middle 
and  approach  the  other  two  points.  Therefore,  the  middle  is  a repellor,  and  the  other  two 
are  attractors.  If  w is  negative,  we  just  need  to  flip  the  sigmoid  about  the  vertical  axis.  The 
construction  procedure  is  the  same.  But,  in  this  case  we  can  observe  oscillations.  And, 
since  there  is  only  one  intersection  point,  it  is  an  attractor. 
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Now,  let  us  take  into  account  the  term  I(t)  in  eq.  3. 3. 2-1.  This  term  represents  an 
external  input  signal  to  the  PE.  If  we  consider  I(t)  as  a time-varying  bias  of  the  activation 
function,  we  can  still  use  the  procedure  mentioned  above  to  compute  the  output  of  the  PE. 
But,  the  sigmoid  has  to  be  shifted  horizontally  at  each  iteration  by  an  amount  that  depends 
on  the  value  of  o (/  (/) ) . An  example  to  construct  phase  portrait  for  a given  I(t)  is  illus- 
trated in  Figure  3-11.  We  first  estimate  the  bias  due  to  input  signal  I(t).  The  corresponding 
sigmoid  curve  can  be  determined.  Then,  we  apply  the  procedure  introduced  above  to  esti- 
mate the  next  sample.  We  can  repeat  these  steps  to  estimate  all  iterates.  This  procedure 
will  also  yield  a phase  portrait  (a  trajectory  of  (x(t),x(t-l)) ) of  the  PE  dynamics. 

As  the  folding  and  sliding  technique  allows  us  to  estimate  graphically  the  output  of 
a linear  memory  in  the  time  domain,  this  method  enables  us  to  estimate  the  output  of  a sin- 
gle PE  in  a phase  plane.  The  resulting  phase  portrait  in  this  method  also  provides  a picture 
of  the  PE  dynamics.  However,  it  is  very  difficult  to  extend  this  method  to  deal  with  a 
group  of  PEs. 
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Figure  3-10.  Computing  iterates  in  phase  plan,  (a)  slope  of  the  sigmoid 
less  than  1,  (b)  slope  of  the  sigmoid  greater  than  1. 
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Figure  3-11.  Computing  iterates  in  phase  plane  (with  input  I(t)). 
(a)  evaluating  o(I(t)),  (b)  constructing  phase  portrait,  and 
(c)  ploting  output  sequence. 


3.  4 Learning  Algorithms  for  ANN^ 

The  learning  algorithms  are  the  update  formulae  for  the  network  parameters  during 
training.  These  algorithms  can  be  divided  into  two  big  categories:  supervised  teaming 
algorithms  and  unsupervised  learning  algorithms.  In  supervised  learning,  a network  is  pre- 
sented with  pairs  of  input  and  target  output  patterns.  The  objective  of  the  training  is  to 
minimize  the  error  between  the  target  output  and  the  actually  network  output.  On  the  con- 
trary, in  an  unsupervised  learning,  training  patterns  do  not  include  desired  outputs.  These 
learning  algoritms  enable  a network  to  extract  features  from  input  patterns.  Therefore,  the 
network  will  display  some  degrees  of  self-organizing  capability  [Koh89]. 
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According  to  the  length  of  the  desired  output  patterns,  The  supervised  learning 
algorithms  can  be  further  divided  into  two  subclasses:  fixed-point  learning  algorithms  and 
trajectory  learning  algorithms.  In  a fixed-point  learning,  the  target  output  pattern  is  a 
point.  We  usually  do  not  sepecify  the  direction  or  the  length  of  the  path  along  which  the 
outputs  of  the  network  approach  the  point.  On  the  other  hand,  in  a trajectory  learning,  the 
desired  output  pattern  is  a trajectory  (or  a sequence  for  a single  network  output).  The  out- 
puts of  a network  are  required  to  follow  the  trajectory. 

In  dynamic  modeling,  the  optimization  criterion  (or  the  cost  function)  is  given  in 
eq.  2. 4. 2-8.  For  the  convenience  of  our  discussions,  I simplify  the  notation  of  the  equation 
and  rewrite  it  as 

r r 

£■  = V ’^dist(x'.-Jj)  eq.  3.4-1 

y=  ii  = 1 

where  r is  the  number  of  training  patterns,  /'  is  the  length  of  each  training  pattern,  x!.  is  the 

'^j 

i-th  data  sample  in  the  j-th  training  sequence,  and  x,  is  the  output  of  the  network  corre- 
sponding to  In  this  formulation,  if  we  choose  /'  larger  than  1,  target  patterns  are 

sequences.  We  need  to  employ  a trajectory  learning  algorithm  to  minimize  E.  If  /'  is  equal 
to  I,  eq.  3.4-1  becomes 

r 

E - ^ dist  (x^  - x)  eq.  3 .4-2 

y = 1 

Each  training  pattern  is  just  a point.  A fixed-point  learning  algorithm  can  be  adopted  in 
network  training.  We  note  that  with  this  choice  the  network  is  trained  as  a one-step  predic- 
tor. 

The  selection  of  the  learning  algorithm  may  also  decide  the  type  of  a required  net- 
work. In  a fixed-point  learning,  we  can  use  either  a static  network  or  a dynamic  network. 
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In  eq.  3.4-2,  the  output  of  the  network  is  assumed  to  converge  to  a fixed-point  in  one  step. 
Therefore,  a static  network  can  be  used.  However,  in  a trajectory  learning  a dynamic  net- 
work is  definitely  required. 

A learning  algorithm  is  derived  from  an  optimization  procedure.  There  are  several 
optimization  procedures  available,  e.g.  random  search.  Genetic  Algorithm  (GA)  [Gol89], 
simulated  annealing  [Kir83],  and  gradient  descent.  These  optimization  methods  have  their 
own  merits  for  some  specific  problems.  But  in  adjusting  massive  parameters,  the  first 
three  procedures  (random  search,  GA,  and  simulated  annealing)  are  not  as  effecient  as  the 
gradient  descend  method.  Therefore,  the  gradient  descent  method  or  its  variations  are 
widely  accepted  in  network  training.  In  this  work,  all  of  the  learning  algorithms  are 
derived  based  on  the  gradient  descent  method.  To  be  able  ot  use  the  gradient  information 
in  the  parameter  update,  the  cost  function  must  be  differentiable.  Therefore,  the  1-2  norm  is 
usually  selected  as  the  distance  metric  in  eq.  3.4-1.  With  this  selection,  the  cost  function 
becomes 


r 


;=  i/=  1 


eq.  4.3-3 


where  1/2  is  just  a factor  that  will  eliminate  a constant  in  the  later  derivations  of  learning 
algorithms.  In  dynamic  modeling,  the  cost  function  given  above  will  be  used. 

In  a gradient  descent  procedure,  we  adjust  all  connection  weights  of  a network  in 
the  opposite  direction  of  the  gradient  of  the  cost  function  at  each  step  such  that  the  cost 
function  can  be  minimized  gradually. The  formula  for  a connection  weight,  w.j,  at  k-th 


update  is  given  by 


104 


eq.  3.4-4 


, and 


Aw  (k)  = -5v,£(*) 

Z u 


]]dE 

ldw..{k-  1) 


J 


where  r)  is  a constant,  and  Ew..{k)  is  the  adjustment  amount  of  w . . at  k-th  update.  Since 

ri  is  related  to  the  learning  speed  in  an  adaptation,  it  is  also  called  learning  rate  or  step 
size.  This  adaptation  will  lead  to  a solution  which  yields  E (A)  = 0. 

When  we  apply  a gradient  descent  algorithm  to  train  a network,  there  exist  two 
problems:  slow  convergence  and  local  minima.  The  activation  function  of  a network  PE  is 
a sigmoid,  which  has  two  saturation  regions.  When  a PE  is  working  in  these  saturation 
regions,  the  derivative  of  its  activation  function  will  become  extremely  small,  which  will 
yield  a small  gradient  estimate.  Consequently,  the  adaptation  will  converge  slowly.  The 
existence  of  local  minima  in  the  adaptation  is  because  the  cost  function  is  a nonquadratic 
function  of  connection  weights.  During  training,  the  adaptation  may  be  trapped  in  a local 
minimum. 

A solution  to  overcome  these  problems  is  to  add  a momentum  term  to  the  adjust- 
ment amount  Avr  {k)  in  eq.  3.4-5  [Rum86].  The  modified  formula  is  given  by 


where  y is  a positive  number  and  less  than  1.  The  first  term  on  the  right-hand  side  (RHS) 
is  a momentum.  If  two  consecutive  updates  are  in  the  same  direction,  th.e  adjustment 
amount  in  that  direction  will  be  enhanced.  Therefore,  as  long  as  the  updates  are  kept  in  the 
direction  of  an  optimal  solution,  this  modified  formula  can  s]jeed  up  tlie  convergence.  In 
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addition,  the  use  of  the  momentum  in  an  adaptation  can  also  reduce  the  chance  of  being 
trapped  in  a local  minimum.  This  can  be  explained  by  an  example  shown  in  Figure  3-12. 
The  X axis  represents  a weight,  and  the  y axis  is  the  cost  function  of  the  weight.  This  curve 
is  called  the  performance  curve.  We  note  that  there  are  two  minima  in  the  curve.  Let  us 
assume  the  solution  at  position  #4  is  the  global  minimum,  and  the  solution  at  position  #2  is 
a local  minimum.  Using  a gradient  descend  method  to  adjust  the  weight  is  like  moving  a 
ball  on  this  curve.  The  movement  of  the  ball  is  controlled  by  the  update  formula.  When  the 
adjustment  amount  equals  to  zero,  the  ball  stops.  According  to  eq.  3.4-4,  after  the  ball  rolls 
down  the  hill  from  position  #1  to  position  #2,  it  will  stop  there  because  Aw ..  equals  to 

zero  at  the  point.  The  adaptation  will  be  trapped  in  a local  minimum.  On  the  other  hand,  if 
the  formula  in  eq.  3.4-4  is  employed.  Aw.,  is  not  zero  at  position  #2,  and  the  ball  can  still 

climb  up  to  position  #3  and  eventually  converge  to  the  global  minimum. 


Figure  3-12.  Adaptation  of  a gradient  descend  method  with  momentum. 


Eq.  3.4-5  can  be  regarded  as  a linear  system  whose  pole  is  located  at  y . Since  y 
is  positive  and  less  than  1,  the  system  acts  as  a low-pass  filter.  In  other  words,  the  gradient 
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information  will  be  processed  by  a low-pass  filter  before  being  used  in  the  update  formula. 
From  the  filtering  point  of  view,  using  eq.  3.4-5  will  avoid  an  abrupt  change  in  the  compu- 
tation of  the  adjustment  amount.  Therefore,  random  adjustments  can  be  diminished,  and 
the  adaptation  will  not  suddently  stop.  This  idea  will  be  further  explored  later  when  I pro- 
posed a modified  trajectory  learning  algorithm. 

In  a gradient  descent  procedure,  there  are  two  ways  to  update  a network  parame- 
ters; one  way  is  to  compute  the  gradient  for  each  training  pattern  (a  point  or  a sequence) 
and  update  the  weights  immediately.  This  is  called  incremental  update  mode.  The  other 
way  is  to  compute  the  gradient  for  all  training  patterns,  but  the  weights  are  updated  after 
we  present  all  training  patterns  once  (one  training  epoch).  This  is  called  batch  update 
mode.  No  matter  which  update  model  is  used,  a training  is  usually  conducted  by  repeat- 
edly presenting  all  of  the  training  patterns  to  a network  until  its  performance  is  satisfac- 
tory. In  other  words,  a training  may  continue  for  many  training  epochs.  The  impact  on  the 
performance  in  choosing  an  update  model  is  still  an  open  issue  in  the  neural  network  com- 
munity. However,  the  incremental  update  model  is  usually  preferred  in  a real-time  applica- 
tion. In  my  experiments,  I will  update  network  parameters  using  the  incremental  update 
mode.  In  this  update  model,  an  incremental  gradient  estimate  is  computed  based  on  a cost 
function,  E.,  which  is  contributed  only  by  j-th  training  pattem.This  cost  function  can  be 

expressed  by 


eq.  3.4-7 


To  simpify  the  notation,  I will  drop  the  index  for  the  training  pattern  in  the  following  dis- 
cussion. Therefore,  eq.  3.4-7  can  be  rewritten  as 

S - 2 
/•=  1 


eq.  3.4-8 
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In  the  following  sections,  we  will  discuss  several  learning  algorithms  based  on  the 
gradient  descent  method.  They  include  the  backpropagation  (BP)  algorithm,  backpropaga- 
tion  through  time  (BPTT)  algorithm,  and  real-time  recurrent  learning  (RTRL)  algorithm 
[Rum86][Wer90][Wil89].  The  BP  algorithm  is  a fixed-point  learning  algorithm  for  static 
networks.  This  algorithm  will  be  applied  to  the  training  of  one-step  signal  predictors  in 
Chapter  4.  The  BPTT  algorithm,  RTRL  algorithm  and  their  variations  are  the  most  popular 
trajectory  learning  algorithms  for  dynamic  network.  These  algorithms  will  be  used  in 
training  dynamic  models  in  Chapter  4.  In  the  last  section,  I propose  a modified  version  of 
the  BPTT  algorithm,  which  can  give  the  advantages  of  both  BPTT  and  RTRL  algorithms. 


3.4.1  Backpropagation  Algorithm  for  TDNNs 

In  my  experiments,  I usually  compare  the  performance  of  a dynamic  network  and  a 
one-step  signal  predictor.  I choose  two-layer  TDNNs  as  the  model  of  one-step  predictors. 
The  networks  are  trained  by  using  the  BP  algorithm  [Wer90][Rum86].  In  this  section,  the 
BP  algorithm  for  a two-layer  TDNN  will  be  derived. 

Since  we  train  the  network  as  a one-step  predictor,  /'  in  eq.  3.4-7  is  equal  to  1. 


Thus,  the  cost  function  becomes 


1 

2 


(x-x) 


eq.  3.4. 1-1 


Where  x is  the  output  of  the  network.  Let  us  denote  the  activation  of  the  p-th  PE  as  x 


which  is  computed  by 


X 

p 


( \ 


W X 
pq 

J 


= a (nei  ) 

p ' p^ 


eq.  3.4.1-2 


Assume  = x.  Then  the  gradient  with  respect  to  a weight  n , which  connects  j-th  PE  to 
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i-th  PE,  can  be  computed  by 

dE 

= -e 

8w .. 

V 


eq.  3.4. 1-3 


If  a weight  between  a hidden  PE  and  the  output  unit,  i.e.  the  subscript  / - 0,  eq. 
3.4. 1-3  will  yield 


= -ecs\{net^)x.  = -b^x. 


eq.  3.4. 1-4 


where  o'q  {net^  is  the  derivative  of  the  activation  function  of  the  output  unit  w.r.t.  netQ. 
If  w ..  is  a weight  connecting  a input  unit  and  a hidden  PE,  we  need  to  apply  the  chain  rule 
in  calculus  to  eq.  3. 4. 1-3, 


dx 


-e 


0 


-e 


dw .. 

V 

dx^dx. 


dx.dw.. 
I u 


= -eWp .o'q  (neto)  &.  (net.)  x 

= -5.x. 

» j 


eq.  3.4. 1-5 


According  to  eq.  3. 4. 1-4  and  3.4. 1-5,  the  formulae  to  compute  the  gradient  can  be 
arranged  in  the  same  form. 


eo 


0 


Wo-SoC. 


i = 0 
otherwise 


eq.  3.4. 1-6 


This  formula  suggests  a backward  propagation  of  the  error  through  the  computation  of 
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5 .’s.  Thus,  the  adaptation  procedure  of  this  learning  algorithm  is  as  follows:  we  first  com- 
pute the  activations  of  the  network  PEs  layer  by  layer  in  forward  path.  The  error  between 
the  target  output  and  the  network  output  is  computed.  Then,  this  error  is  propagated  back- 
ward to  compute  the  gradients  according  to  eq.  3.4. 1-6.  Therefore,  this  algorithm  is  named 
the  backpropagation  algorithm. 

The  BP  algorithm  has  been  shown  very  successful  in  training  static  networks.  But, 
we  can  not  use  this  algorithm  directly  to  train  a dynamic  network.  This  is  because  the  acti- 
vations of  a static  network  do  not  contain  the  information  of  the  activations  in  the  previous 

dx. 

time  steps.  Therefore,  the  computation  of  the  derivatives,  — is  very  straightforward. 

dw .. 
u 

However,  there  is  a method  proposed  by  Werbos  to  convert  a dynamic  network  into  a 
static  network  such  that  we  can  still  use  the  BP  algorithm  to  train  the  resulting  static  net- 
work. This  is  the  BPTT  algorithm. 

3.4.2  Backpropagation  Through  Time  Algorithm 

In  dynamic  modeling,  the  target  pattern  is  a sequence.  The  cost  function  is  given  in 
eq.  3.4-8.  To  avoid  a confusion  in  the  notation  and  introduce  the  temporal  index,  I rewrite 
eq.  3.4-8  as 

T T 

1 2 1 2 

eq.3.4.2-1 

t=\  t=\ 

where  d (/)  and  d (t)  are  data  samples  in  the  training  sequence  and  the  network  output  at 
time  step  t respectively,  and  T is  the  length  of  the  training  sequence.  The  cost  function  in 
this  form  will  be  used  in  all  discussions  of  the  trajectory  learning  algorithms.  And,  to  gen- 
eralize the  discussion,  let  us  assume  we  have  a fully-connected  recurrent  network  of  N 
PEs.  We  can  designate  any  one  of  the  PEs  as  the  output  unit. 
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The  main  idea  of  the  BPTT  algorithm  is  to  unfold  a fully  connected  recurrent  net- 
work over  time  to  obtain  an  equivalent  MLP  structure.  An  e.xample  of  this  unfolding  pro- 
cess is  given  in  Figure  3-13.  Each  layer  of  this  MLP  represents  the  activations  of  the 
network  at  a specific  time  step.  The  weight  connecting  two  PEs  in  this  equivalent  network 
is  a copy  of  the  original  weight.  Each  copy  of  the  weight  is  treated  as  an  independent  vari- 
able. The  error  gradient  w.r.t.  a certain  weight  in  the  original  network  can  be  computed  by 
summing  up  the  error  gradients  w.r.t.  each  copy  of  this  weight  in  the  corresponding  MLP. 

This  computation  can  be  expressed  as  follows 

T T 


dw 


= E 


u 


t=  1 




dw.j  (0 


= Z 


dE 


d 


t=  1 


dx.  (/)  dw.j  (/) 


v,(0 


eq.  3. 4. 2-2 


where  x.  (t)  is  the  activation  of  i-th  PE  at  time  step  t,  and  vr  . (/)  is  the  copy  of  h' . at  time 
step  t.  Since  w ..  (/)  will  appear  only  in  the  formula  to  compute  x.  (i) , when  we  apply  the 
chain  rule  to  compute  dE/ dw  ..{t)  we  will  obtain  the  expression  on  the  RHS  of  the  eq. 
3. 4. 2-2.  The  activation  of  the  i-th  PE  at  time  step  t is  computed  by 


(0  = o,.(ajcL(0)  = a. 


k k 


eq.  3. 4.2-4 


Consequently,  we  have 


d 

dw.j  (0 


x.(0  = o'. (net. (t))  -x-it- 


eq.  3. 4. 2-5 


Ill 


Figure  3-13.  Time-unfolding  of  a fully-connected  recurrent  network. 


We  define  the  backpropagation  error  d.  (t)  = 


dE 


dx.  (0 


, and  it  can  be  computed  by 


where 


5.(0  = 


dE 


d 


dx.  (0  dx.  (0 
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eq.  3. 4. 2-6 
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According  to  eq.  3.4.2-5  and  3.4.2-6,  we  can  rewrite  eq.  3.4.2-2  as  follows 


Z 

t=  1 


dE 


dx.(t)  dw .. 

' V 


X.  (0 


T 

= ^ 5 . (0  o'  (net.  (0 ) • (/  - 1 ) 

/ = 1 


eq.  3.4. 2-7 


This  formula  is  actually  the  backpropagation  algorithm  for  the  equivalent  MLR  Since  the 
MLR  is  a time-unfolded  version  of  the  original  recurrent  network,  this  algorithm  is  named 
backprogation  through  time.  Because  the  backpropagtion  error  5^.  is  defined  differently  in 

the  BR  algorithm  and  the  BRTT  algorithm,  eq.  3.4. 1-6  and  3. 4.2-7  do  not  have  exactly  the 
same  form.  But  the  error  can  be  propagated  backward  in  the  same  way  in  both  algorithms. 
A continuous-time  version  of  the  BRTT  algorithm  was  proposed  by  Rearlmutter  [Rea89]. 

In  the  BRTT  algorithm,  the  weights  can  be  updated  only  after  the  last  sample  of  a 
training  sequence  is  presented.  Therefore,  we  need  to  store  all  of  the  activations  in  the 
time  interval  [1,T].  The  required  memory  size  will  increase  with  the  length  of  the  training 
sequence.  In  other  words,  the  space  complexity  of  the  BRTT  algorithm  is  (5(NT).  To  ease 
this  requirement,  Williams  proposed  to  limit  the  backpropagation  pass  to  a fixed  number 
of  time  steps  K [Wil89].  The  error  gradient  then  becomes 

T 

^ 2 5,.  (0  o'  (net. (0 ) -x.(t-\)  eq.  3. 4.2-8 

t=T-K+\ 

This  is  called  truncated  backpropagation  through  time  or  BRTT(K)  algorithm.  Note  that 
the  BRTT(K)  does  not  compute  the  exact  error  gradient  when  K is  smaller  than  T.  How 
well  this  approximation  works  depends  on  how  well  the  last  K elements  of  the  gradient 
estimates  can  represent  the  true  value. 

In  the  BRTT  algorithm,  we  store  all  of  the  network  activations  until  the  last  error  is 
computed.  Then,  before  the  next  training  sequence  is  presented,  we  need  to  compute  the 
gradients  and  update  all  of  the  weights.  The  time  complexity  of  the  BRTT  algorithm  is 
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(5(N^T).  In  real-time  applications,  we  usually  tiy  to  avoid  focusing  all  intensive  computa- 
tions in  a specific  time  period,  and  a recursive  training  algorithm  is  thus  preferred.  Real- 
Time  Recurrent  Learning  (RTRL)  is  an  algorithm  developed  out  of  this  requirement. 

3.4.3  Real-Time  Recurrent  Learning  Algorithm 

Using  the  RTRL  algorithm  to  train  a network  to  learn  a sequence,  we  compute  the 
incremental  gradient  estimate  at  each  time  step  using  a recursive  formula.  Thus,  the  gradi- 
ent computation  can  be  performed  only  forward  in  time.  To  derive  tlie  RTRL  algorithm, 
we  start  with  the  cost  function  given  in  eq.3.4.2-1,  and  we  compute  its  gradient  as  follows 


eq.  3. 4. 3-1 


where  x^(t)  is  the  output  of  the  network,  and  dJ'^  (I)  / dw  is  the  incremental  gradient 

estimate  at  time  step  t.  We  note  that  instead  of  decomposing  each  weight  into  several  tem- 
poral copies  (as  we  did  in  the  BPTT  algorithm)  we  decompose  the  cost  function.  Accord- 
ing to  eq.  3.4. 2-4,  we  can  compute  the  derivative  of  the  activation  x.  (/)  w.r.t.  the  weight 

w as  follows, 

pq  ’ 


dw 


-XfiO  = 


P‘1 


d 


udw  J 

P‘l 


eq.  3. 4. 3-2 


where  5.  is  the  Kronecker  function,  i.e.  5.^  equals  to  1 if  / = p and  equals  to  0 other- 
wise. This  is  a recursive  formula  to  compute  the  derivative r?x^.(/)  which  can  be 

plugged  into  eq.  3.4. 3-1  to  estimate  dE(t)  at  each  time  step.  Therefore,  using  eq. 


3.4. 3-1  and  eq.  3.4. 3-2,  we  are  able  to  calculate  the  incremental  gradient  estimate  at  each 


114 


time  step  in  a recursive  manner.  After  we  present  the  whole  training  sequence,  we  just  sum 
up  these  incremental  gradient  estimates  and  update  the  network  parameters  accordingly. 
The  backward  propagation  of  the  errors  is  not  necessary  in  this  algorithm. 

Athough  different  approaches  are  taken  to  compute  the  gradient,  both  BPTT  and 
this  recursive  algorithms  will  yield  the  same  gradient  estimate.  For  real-time  applications, 
Williams  proposed  to  update  the  weight  whenever  the  incremental  gradient  dE  (t)  / 

is  available.  The  resulting  algorithm  is  called  the  Real-Time  Recurrent  Learning  algorithm 
[Wil89].  In  the  derivations  of  eq.  3.4.3-1  and  eq.  3.4.3-2,  we  assume  the  weight  is 

fixed.  Therefore,  for  a good  approximation  of  the  gradient  estimate  a small  step  size  is 
usually  used  in  the  RTRL  algorithm. 

Since  all  computations  in  the  RTRL  algorithm  are  forward  in  time,  it  is  not  neces- 
sary to  store  all  of  the  previous  activations  except  the  last  ones.  The  required  memory  size 
is  fixed  and  will  not  increase  with  the  length  of  the  training  sequence.  This  feature  is  cer- 
tainly desirable  when  the  training  sequence  becomes  very  long.  We  can  analyze  the  com- 
plexity of  the  RTRL  algorithm  as  follows:  for  a fully  connected  recurrent  network  of  N 

PEs,  there  are  connection  weights.  According  to  eq.  3. 4. 3-2,  there  are  N derivatives  to 
be  maintained  for  each  weight.  Therefore,  the  space  complexity  of  the  RTRL  algorithm  is 

(5(N^).  To  update  each  derivative  requires  N operations.  Thus  the  time  complexity  of  the 

algoritm  at  each  time  step  is  The  complexity  analysis  indicates  that  the  RTRL  algo- 

rithm requires  tremendous  computation  resources  in  training  a large  network. 

Another  problem  of  the  RTRL  algorithm  may  occur  when  we  apply  the  algorithm 
to  train  a network  with  a long  sequence.  As  we  mentioned  before,  to  obtain  a good  approx- 
imation of  the  gradient  estimate  the  setp  size  for  each  incremental  update  has  to  be  small 
in  the  RTRL  algorithm.  In  other  words,  the  adjustment  amount  of  each  weight  becomes 
small  after  first  few  iterations.  Since  the  adaptation  to  the  d.c.  component  of  the  sequence 
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is  usually  very  quick,  the  network  will  develop  a fix-point  dynamics  at  the  begining  of  the 
training.  Because  the  adjustment  of  each  weight  is  so  small  afterward,  the  adaptation  may 
not  be  able  to  drive  the  dynamics  of  the  network  through  bifurcation  points  [Wil89],  i.e.  to 
yield  a network  to  generate  oscillations.  Thus,  the  training  will  end  up  Just  changing  the 
network  dynamics  from  one  fixed  point  to  another.  Consequently,  the  network  may  never 
learn  the  dynamics  of  the  sequence. 

To  solve  this  problem,  Williams  and  Zipser  suggested  to  introduce  a correction 
called  teacher  forcing  into  the  training  [Wil89].  This  is  implemented  as  follows:  every  sev- 
eral iterations  the  actual  output  of  the  network  is  replaced  with  its  target  signal.  The  target 
signal  instead  of  the  actual  output  is  fed  back  in  the  recurrent  loops  of  the  network.  In  this 
manner,  we  force  the  training  into  a transient  such  that  the  adjustments  of  weights  become 
large  enough  to  bring  in  more  possible  dynamics.  This  may  improve  the  learning  in  terms 
of  the  final  error.  However,  the  network  may  not  work  as  we  wish  after  the  training.  This 
is  due  to  the  absence  of  the  teacher  force  in  the  opearting  mode  of  the  network.  A possible 
solution  to  this  problem  is  to  remove  the  teacher  force  from  the  training  when  the  error 
drops  to  a certain  level  and  continue  the  training  for  a while  [Wil89]. 

To  summarize,  the  RTRL  algorithm  requires  tremendous  computational  resources 
when  the  size  of  the  network  is  large,  and  the  algorithm  may  need  some  extra  procedures, 
e.g.  teach  forcing,  to  develop  more  possible  dynamics  in  neural  networks.  The  biggest 
advantage  of  the  RTRL  algorithm  is  that  the  required  memory  size  remains  constant. 

To  compare  the  RTRL  and  the  BPTT  algorithm,  I list  the  complexities  for  both 
algorithms  in  Table  3-1  for  the  training  of  a fully  connected  recurrent  network  of  N PEs  to 
learn  a T-sample  long  sequence.  We  note  that  the  time  complexity  of  the  RTRL  algorithm 

is  because  the  complexity  is  at  each  time  step.  From  Table  3-1,  we  may  con- 

clude that  the  RTRL  has  the  advantage  of  fixed  memory  size,  while  the  BPTT  is  more 
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computational  efficient.  In  the  next  section,  I propose  a modified  BPTT  algorithm  that  can 
give  us  both  advantages. 


Table  3-1.  Complexities  of  trajectory  learning  algorithms. 


Time 

Complexity 

Space 

Compexity 

BPTT 

(5(N^T) 

6i(NT) 

RTRL 

(5(N'‘T) 

(9(N^) 

BPTT(K) 

(5(N^K) 

(5(NK) 

Modified 

BPTT 

(5(N^T+NKT) 

(5(NK+KT) 

3 4 4 BPTT  with  Fixed  Memory  Requirement 

In  the  BPTT  algorithm,  we  need  to  store  the  activations  of  each  individual  PE  at  all 
time  steps  for  the  computation  of  the  backpropagation  errors  and  the  gradients.  This  can 
be  implemented  by  clocking  the  activation  of  a PE  at  each  time  step  into  a delay  line 
whose  order  is  equal  to  T-1,  where  T is  the  length  of  the  training  sequence.  If  a network 
has  N PEs,  N sets  of  this  delay  lines  will  be  needed  to  maintain  the  activations.  The 
required  memory  size  may  become  impractically  large  if  we  have  long  training  sequences. 

To  decouple  the  required  memory  space  and  the  length  of  training  sequences.  We 
propose  to  replace  delay  lines  with  gamma  memories  to  store  the  activations  [Pri93d].  As 
we  mentioned  before,  the  effective  memory  depth  of  a gamma  memory  is  defined  as  K/p, 
where  K is  the  order  of  the  memory  and  p is  the  parameter  of  the  memory.  If  p is  selected 
less  than  1,  a small  order  gamma  memory  can  retain  the  information  in  the  same  temporal 
window  as  a high  order  delay  line.  Since  the  required  temporal  window  for  the  storage  of 

the  activations  is  T,  we  should  keep  the  ratio  K/p  equal  to  T.  When  the  length  of  training 
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sequences  increases,  we  can  just  decrease  the  value  of  \x  to  extend  the  memory  depth  and 

keep  K as  a constant.  Thus,  the  memory  size  to  store  the  activations  can  be  fixed. 

This  modification  can  be  regarded  as  a process  to  project  a T-dimensional  vector, 

representing  the  activations  of  a PE  for  T time  steps,  into  a K-dimensional  memory  space. 

This  projection  operation  can  be  expressed  in  a matrix  form, 

X = AY  eq.  3.4. 4-1 


Xo(7) 

y(T) 

where  X = 

• • • 

Y = 

/ 

• • • 

T(2) 

x^(T) 

x(i)_ 

A = 

a.. 

f 


and 


The  i-th  element  of  the  vector  X,  x.(T) , is  the  content  of  the  gamma  memory  at  i-th  tap 

after  we  clocked  in  T data  samples,  and  y(t)  is  the  t-th  data  sample  of  the  input 
sequence.  In  our  case,  y(t)  represents  a sequence  of  activations.  A is  a (K+l)-by-T  matrix 
whose  entry,  , is  the  data  sample  of  the  impluse  response  at  the  output  of  the  i-th  tap  at 

time  step  j.  An  example  of  the  matrix  A for  T=6  and  K=3  is  given  by 


1 0 0 0 0 0 
0 p p ( 1 - l-i)  P ( 1 - P)  ^ |t  ( 1 - |t)  ^ p ( 1 - p) 
00  p^  2p“(l -p)  3p^(l-p)“  4p^(l -p)^ 

0 0 0 p^  3p^(l-p)  6p^(l-p)^ 


eq.  3. 4. 4-2 


In  the  computation  of  backpropagation  errors  and  the  estimation  of  gradients,  we 
need  to  retrieve  the  activations  from  the  memory.  If  we  select  K+1  equal  to  T,  A becomes 
a sqaure  matrix.  Since  A is  nonsingular,  Y can  be  uniquely  determined  by  X.  The  activa- 
tions can  be  computed  by 
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y = A~^X  eq.  3.4.4-3 

where  A~^  is  the  inverse  matrix  of  A.  However,  our  purpose  is  to  reduce  the  required 
memory  space.  We  usually  choose  K+1  much  smaller  than  T.  Consequently,  eq.  3. 4.4-1 
represents  an  underdetermined  linear  system.  The  best  estimate  of  Y in  the  sense  of  fr2 

norm  is 

f = A^X  eq.  3. 4.4-4 

where  is  the  Moore-Penrose  pseudoinverse  of  A.  Since  all  of  the  gamma  memories 
that  we  utilize  to  store  the  activations  have  the  same  order,  we  need  to  store  only  one 
pseudo-inverse  matrix.  Whenever  we  need  to  retrieve  a sequence  of  activations,  we  can 

directly  apply  eq.  3. 4. 4-4  to  obtain  the  estimates. 

Using  this  modified  BPTT  algorithm,  the  memory  size  required  to  store  the  net- 
work activations  can  be  reduced  from  NT  down  to  NK.  Besides  the  introduction  of  the 
errors  in  estimating  the  activations,  another  price  that  we  pay  for  the  reduction  of  the 
required  memory  size  is  additional  operations  to  retrieve  the  activations.  The  number  of 
these  operations  is  NK(T-l).  Thus,  the  time  complexity  of  this  modified  BPTT  algorithm 

is  (5(N^T+NKT).  Although  we  use  the  same  memory  size  to  store  the  activations,  this 

modified  algorithm  is  different  from  the  truncated  BPTT  (BPTT(K))  algorithm  because  we 
do  not  just  maintain  the  last  K activations.  In  a temporal  pattern  classification  problem, 
this  modified  BPTT  algorithm  with  p = 0.1667  and  K=  pT  has  been  shown  to  be  able  to 
perform  better  than  the  truncated  BPTT  algorithm  (BPTT(K)),  and  work  as  well  as  the 
BPTT  algorithm  in  terms  of  the  convergence  speed  and  the  final  mean  square  error 
[Pri93d]. 


119 


Although  the  memory  size  to  store  the  activations  can  be  reduced,  we  note  that 
using  this  modified  BPTT  algorithm  we  also  need  some  memory  space  to  store  the  entries 
of  the  pseudo-inverse  matrix.  Since  the  entries  in  the  first  row  and  the  first  column  of  the 

pseudo-inverse  matrix  are  all  zeros  except  a|j  = 1 , the  required  memory  size  to  store  the 

pseudo-inverse  matrix  is  K(T-l).  Therefore,  the  total  memory  size  required  for  this  algo- 
rithm is  NK  + K(T-l),  compared  to  NT  required  for  the  conventional  BPTT  algorithm. 
Thus,  the  space  complexity  of  the  modified  BPTT  algorithm  is  0(NK+KT).  To  actually 
reduce  the  required  memory  size,  the  minimal  length  of  the  training  sequences  can  be 
computed  by 

T>K(N-\)/ (N-K)  eq.  3.4.4-S 

Here,  we  assume  K < N.  With  the  selection  of  a small  K,  this  condition  can  be  easily  satis- 
fied even  for  a small  T. 

When  we  clock  a sequence  into  a gamma  memory,  we  can  observe  that  the  infor- 
mation of  the  first  few  data  samples  in  the  sequence  is  more  significant  in  the  later  mem- 
ory taps  than  in  the  first  few  taps.  Therefore,  to  estimate  the  elements  of  Y may  not  need 
all  of  the  elements  of  X.  This  can  be  seen  from  the  following  example.  Assume  we  clock 
in  T data  samples  into  a K-th  order  gamma  memory  and  a (T-l)-th  order  delay  line.  After- 
wards, we  use  optimal  linear  estimators  to  estimate  the  original  data  samples  based  on  the 
contents  of  the  gamma  memory.  A schematic  diagram  of  this  estimation  is  illustrated  in 

Figure  3-14.  In  this  example,  I choose  T=24,  K=12,  and  p=0.5.  The  estimate  for  y(i)  can 
be  expressed  by 
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j = 


K 
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K-p  + \ 
T-i 


w .x.(T) 

IJ  ^ 


Hi)  = 


I % 


T-i 


\ 


I 

/= 1 


w ..x.(T) 


T-i>K 


eq.  3. 4. 4-6 


p<T-i<K 


- 1 <p 


where  p is  the  order  of  the  estimator,  and  vv ..  are  the  coefficients  of  the  estimator.  We 

note  that  the  estimations  are  always  based  on  the  last  p memoi'y  taps  that  contain  the  to-be- 
estimated  data  sample.  In  Figure  3-15, 1 show  the  estimation  errors  when  the  estimators  of 
different  orders  are  used.  We  note  that  the  errors  in  estimating  original  data  samples  can 
not  be  improved  significantly  after  we  increase  the  order  of  the  estimators  over  4.  When  p 
is  increased  to  12,  the  outputs  of  these  estimators  are  the  same  as  the  results  that  we  can 
obtain  from  eq.  3. 4. 4-3.  Therefore,  it  is  actually  not  necessary  to  store  all  of  the  entries  of 

the  pseudo-inverse  matrix.  According  to  my  experimental  results,  when  ).i  is  set  to  0.5  the 

estimators  of  order  equal  to  K/3  usually  can  perform  as  well  as  the  estimation  using  A"^. 
Thus,  in  this  case  the  memory  space  to  store  the  coefficients  of  the  estimators  can  be 
reduced  to  one  third  (from  K(T-l)  to  K(T-l)/3).  In  addition,  the  number  of  operations 
required  to  estimate  the  activations  can  also  be  reduced  to  one  third.  If  a smaller  p,  is 
selected,  the  order  of  the  estimators  can  be  further  reduced.  However,  the  estimation  error 


will  increase. 
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( To  train  a recurrent  network  to  generate  a sinewave  has  been  used  as  a benchmark 

task  to  test  a trajectory  learning  algorithm.  In  the  following  experiment,  I use  the  BPTT 
algorthm,  the  modified  BPTT  algorithm,  and  the  truncated  BPTT  algorithm  to  train  a 
fully-connected  recurrent  network  of  7 PEs  to  learn  a sinewave.  The  frequency  of  the  sig- 
nal is  20  Hz,  and  the  signal  was  sampled  at  500  Hz.  Therefore,  there  are  25  data  samples 
per  cycle.  I choose  this  number  as  the  length  of  training  sequences.  The  training  sequences 
are  prepared  every  10  data  samples  from  the  signal.  The  signal  has  totally  300  data  sam- 
ples. 

For  the  modified  BPTT  algorithm,  I select  K=12  and  p = 0.5.  According  to  Figure 
3-15,  we  know  that  3-rd  order  (having  4 taps)  estimators  can  estimate  the  activations 
almost  as  well  as  12-th  order  estimators  (or  using  pseudo-inverse).  Therefore,  I just  store 
the  entries  of  a 4 by  23  matrix  for  the  retrieval  of  activations.  And,  I choose  K equal  to  13 
for  the  truncated  BPTT  algorithm. 

The  learning  curves  are  given  in  Figure  3-16.  We  find  that  the  modified  BPTT 
algorithm  converges  faster  than  the  BPTT  algorithm,  and  the  training  using  the  truncated 
BPTT  algorithm  does  not  converge  in  600  training  epochs.  After  training,  I iterate  a 
sequence  of  200  samples  from  both  networks  trained  by  using  the  BPTT  algorithm  and  the 
modified  BPTT  algorithm.  They  are  shown  in  Figure  3-17.  We  note  that  the  network 
trained  by  using  the  modified  BPTT  algorithm  performs  a litter  bit  better  due  to  a smaller 
final  mean  square  error  in  training. 

These  results  show  that  in  this  task  the  modified  BPTT  algorithm  can  yield  a reli- 
able gradient  estimation  during  training  when  we  reduce  about  one  half  the  memory  space 
to  store  activations.  The  truncated  BPTT  algorithm  fails  to  converge  before  we  stopped  the 
trianing.  Because  the  algorithm  just  estimates  the  gradients  for  the  last  K iterations,  the 
adapatation  may  not  move  towards  an  opitmal  solution  most  of  times. 
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Although  the  modified  BPTT  algorithm  performs  better  than  the  BPTT  algorithm 
in  this  experiment,  we  should  note  that  we  still  try  to  use  the  modified  algorithm  to  com- 
pute the  gradient  as  close  as  possible  to  the  estimate  of  the  BPTT  algorithm.  One  possible 
reason  why  the  modified  algorithm  works  better  is  that  the  frequency  of  the  target  signal 
is  low  compared  to  the  sampling  rate.  Usually,  the  network  activations  will  synchronize 
with  this  frequency  if  its  PEs  are  working  in  their  linear  regions.  When  we  store  these  acti- 
vations in  a low-pass  filter  memory  like  gamma  memories,  we  will  not  distort  the  activa- 
tions severely.  Therefore,  we  still  can  obtain  a very  good  gradient  estimated  from  the 
retrieved  activations.  In  addition,  as  we  mentioned  before  a low-pass  filtered  gradient 
estimate  will  introduce  a momentum  to  the  adaptation.  Thus,  a faster  convergence  can  be 
obtained. 
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Figure  3-14.  Schematic  diagram  in  estimating  original  data  samples 
based  on  the  contents  of  a gamma  memory. 


Figure  3-15.  Errors  in  estimating  original  data  samples  with 
estimators  of  different  orders. 


Figure  3-16.  Learning  curves  for  different  learning  algorithms. 


(a) 


(c) 


Figure  3-17.  Results  of  (a)  the  BPTT  algorithm,  (b)  the  modified  BPTT 
algorithm,  and  (c)  the  original  test  signal. 


CHAPTER  4 

EXPERIMENTAL  RESULTS 


It  has  been  shown  that  by  changing  only  one  of  the  parameters,  a simple  recurrent 
neural  network  can  exhibit  a variety  of  dynamical  behaviors  [Cha92],  However,  when  we 
adapt  the  parameters  of  a network  to  learn  a prescribed  dynamics,  the  resulting  model  may 
become  highly  sensitive  to  the  modeling  error  due  to  a large  gain.  In  other  words,  after 
training,  the  network  will  not  be  able  to  sustain  the  desired  dynamics.  Therefore,  the  range 
of  possible  network  dynamics  that  we  can  obtain  through  training  will  be  quite  limited.  In 
the  first  section,  I will  illustrate  this  problem  by  examining  how  a 2-PE  recurrent  network 
learns  to  produce  a given  sinewave.  The  method  that  I devised  in  training  this  2-PE  net- 
work not  only  gives  us  more  insight  about  the  role  of  each  parameter  in  the  task  but  also 
results  in  a much  faster  convergence  in  training. 

In  section  4.2,  I construct  two  dynamic  models  of  a highly  nonlinear  signal  using 
two  different  adaptation  schemes,  which  are  respectively  equivalent  to  the  equation  error 
and  the  output  error  formulations  in  linear  ARMA  modeling.  The  model  trained  with  a 
global  feedback  (output  error  scheme)  is  shown  to  be  able  to  capture  the  dynamics  of  the 
signal;  while  the  other  model  trained  as  a one-step  predictor  (equation  output  scheme)  can 
not  sustain  the  learned  dynamics  for  long.  This  is  interpreted  as  a deficient  dynamic  model 
of  the  system  that  produced  the  time  series. 

The  results  of  the  dynamic  modeling  for  chaotic  signals  are  given  in  section  4.3. 
Two  chaotic  times  series  are  studied.  They  are  the  Mackey-Glass  (mg30)  signal  and  the 
Lorenz  signal.  The  former  is  a chaotic  time  series  having  relatively  small  positive 
Lyapunov  exponents  (0.0071,  0.0027  nats/sec),  while  the  later  contains  a large  positive 
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Lyapunov  exponent  (2.16  bits/sec).  The  criteria  to  determine  the  length  of  the  training 
sequences  and  the  model  validation  methods  proposed  in  Chapter  2 are  applied  to  the 
experiments. 

In  Chapter  2,  we  discussed  a projection  method  to  remove  high-dimensional  noise 
in  state  space,  and  showed  that  signals  were  still  contaminated  by  the  noise  in  the  projec- 
tion space.  In  section  4.4, 1 propose  to  train  a dynamic  artificial  neural  network  as  a state 
predictor,  having  an  gamma  memory  filter  at  each  input  unit.  Based  upon  a special  feature 
of  the  gamma  memory,  a criterion  is  proposed  to  stop  the  training  of  the  network  at  the 
point  of  the  highest  signal-to-noise  ratio,  i.e.  to  exclude  the  learning  of  the  noise  dynam- 
ics. Consequently,  the  noise  in  the  signal  can  be  further  removed. 

4 1 Modeling  Sinewaves  with  a 2-PE  Network 

Training  a network  to  produce  a desired  sinusoidal  signal  has  been  used  as  a 
benchmark  for  different  trajectory  learning  algorithms[Wil89][Pea89][Tsu90].  This  task 
can  be  considered  as  the  dynamic  modeling  of  a limit  cycle.  In  linear  systems,  we  can  eas- 
ily design  a second  order  HR  filter  which  acts  as  an  oscillator.  However,  to  the  best  of  my 
knowledge,  there  is  no  report  in  the  literature  regarding  the  successful  generation,  through 
learning,  of  a desired  sinewave  by  an  autonomous  two-neuron  network.  For  example, 
some  research  group  reported  a two-neuron  network  whose  output,  after  training,  is  more 
like  a quasi-periodic  signal  than  a sinusoidal  waveform  [Wil89][Tsu90],  while  the  other 
employed  a network  of  larger  size  to  accomplish  the  task  [Pea89].  In  either  case,  a long 
data  segment  and  many  training  epochs  are  required.  Therefore,  the  first  question  that  I try 
to  answer  from  this  experiment  is  “are  we  able  to  train  a two-neuron  network  to  model  a 
simple  limit  cycle?”. 

Another  motivation  of  this  experiment  is  to  understand  what  is  the  role  of  each 
parameter  in  a 2-PE  network.  This  understanding  may  lead  to  parsimonious  neural  net- 
work architectures  and  also  to  ways  to  faster  train  recurrent  network.  Because  of  its  mas- 


127 


sive  connections,  it  is  very  difficult  to  identify  the  function  of  each  network  parameter  in 
solving  a task  unless  the  size  of  the  network  is  small.  Therefore,  if  we  can  train  a two-neu- 
ron network  to  model  a desired  sinewave  successfully,  we  are  able  to  understand  the  func- 
tions of  the  network  parameters. 

In  this  experiment,  I propose  a method  to  train  a two-neuron  recurrent  network. 
The  parameters  of  this  network  are  divided  into  two  groups.  One  group  of  the  parameters 
can  be  determined  through  analysis.  The  other  group  is  determined  adaptively.  The  exper- 
imental results  show  that  this  two-neuron  network  can  indeed  be  trained  to  produce  a 
desired  sinewave  by  using  the  proposed  method.  In  addition,  this  method  also  gives  a dra- 
matic improvement  in  the  training  in  terms  of  the  required  data  samples  and  the  learning 
speed.  More  important  is  through  the  analysis,  we  are  able  to  identify  the  role  of  each 
parameter  in  modeling  a sinewave. 

4 11  Analysis  for  Sinewave  Generation 

A two-PE,  fully  connected  network  is  shown  in  Figure  4-1.  The  parameters  to  be 
determined  include  the  gains  of  the  sigmoidal  function  ( P^.  ),  the  connection  weights  ( 
w ),  and  the  biases  (0^. ).  The  core  idea  of  my  proposed  method  is  to  divide  the  network 
into  two  parts  and  identify  the  role  of  each  PE.  On  the  left-hand  side,  the  dynamics  of  PE 
#1  is  governed  by 

5j(w)  = Oj(p,(Wjj5^(w- 1) +^2(«- 1) +0i))  =o,(ne/j(n))  eq.  4.1. 1-1 

where  Si(n)  and  S2(n)  are  the  output  and  the  input  of  PE  #1  respectively,  and  Oj  ( ) is  a 
sigmoidal  activation  function.  Since  the  self-recurrent  feedback  and  the  input  of  PE#1  can 
be  weighted  differently  by  adjusting  Wjj,  we  can  set  the  weight  Wjj  to  1 without  loss  of 
generality.  Therefore,  eq  4. 1.1-1  can  be  rewritten  as 

5i(«)  = Oj(pj(5,(«-l)  +S^(n-\)  +0j)) 


eq  4. 1.1-2 
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Figure  4-1.  A two-neuron  network. 

Since  our  task  is  to  generate  a target  sinewave,  we  assume  that  Si(n)  is  known  at 
all  times.  The  required  input  signal  S2(n)  can  be  determined  by  the  following  procedure. 

1.  Choose  Pj  and  82(0). 

2.  Compute  0j 

01  = eq.  4.1. 1-3 

3.  Compute  the  sequence  S2(n)  iteratively 

S^{n)  = ^0“’ (5,  («+ 1)) («) -0j  eq.  4.1. 1-4 

From  eq.  4. 1 . 1-3  and  eq.  4. 1 . 1-4,  we  note  that  the  shape  of  S2(n)  is  uniquely  deter- 
mined by  the  selection  of  pj . The  selection  of  82(0)  affects  only  the  dc  component  of  the 
resulting  sequence  S2(n).  We  also  note  that  once  we  determine  Pj,S2(0),  and  0j,  we  can 
implement  a circuit  to  compute  S2(n)  according  to  eq.  4. 1.1-4.  After  we  compute  the 
sequence  S2(n),  the  parameters  on  the  right-hand  side  of  the  network  in  Figure  4-1  can  be 
determined  adaptively.  The  sequence  Sj(n)  becomes  the  input  signal,  and  S2(n)  is  the  tar- 
get signal.  Any  trajectory  learning  algorithm  (real  time  recurrent  learning  (RTRL)  or  back- 
propagation  through  time  (BPTT) ) can  be  applied  to  train  the  weights  wj2,  W21,  W22,  and 
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02,  P2-  The  advantage  of  this  proposed  method  is  that  the  size  of  the  training  problem  is 
reduced.  The  adaptation  of  an  autonomous  two-PE  network  is  converted  into  the  adapta- 
tion of  a single  PE  with  external  input.  As  long  as  the  external  input  and  the  target  signal 
have  similar  waveforms,  this  approach  will  ease  the  adaptation  a lot. 

4 1 2 Some  Consideration  in  the  Proposed  Method 

Since  there  are  only  two  PEs,  the  number  of  mappings  that  can  be  modeled  by  the 
network  through  learning  is  very  limited.  To  ease  the  adaptation  of  the  right-hand  side  of 
the  network,  we  should  tune  the  gain  of  the  activation  function  of  PE#1  to  a value  such 
that  the  waveforms  of  Sj(n)  and  S2(n)  are  similar.  Later,  in  the  derivation  of  a linear 
approximation  of  the  relationship  in  eq.  4. 1.1-4,  we  will  find  that  this  requirement  can  be 
achieved  easily  as  long  as  the  PE#1  is  designed  to  work  in  the  “pseudo”  linear  region  of  its 
activation  function. 

Therefore,  the  main  task  of  the  right-hand  side  of  the  netH'ork  is  to  yield  a neces- 
sary pha.se  shift  between  its  input  (Sfn))  and  its  output  (S2(n)).  We  note  that  the  input  of  a 
PE  always  leads  its  output  in  phase  if  we  try  to  preserve  the  shape  of  the  waveform.  Since 
Si(n)  and  S2(n)  are  the  output  and  the  input  of  the  first  PE,  the  phase  of  S2(n)  must  lead 
the  phase  of  Si(n).  In  Figure  4-2, 1 show  the  constructed  sequences  of  S2(n)  with  different 
Pj’s.  The  sequences  are  normalized  in  terms  of  their  amplitudes.  The  leading  phase  of 
S2(n)  becomes  larger  when  we  increase  the  value  of  Pj.  The  leading  phase  can  be  up  to  ti: 
radian.This  argument  can  be  verified  as  follows.  We  approximate  the  inverse  sigmoidal 
function  with  a linear  function  in  eq.  4. 1 . 1-4  and  obtain 

« («+l)-5,  in) 

p; 


eq.  4. 1.2-1 
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where  a is  a constant.  Let  us  assume  iS’j  (n)  = sin  (w«A/) , then  we  have 

Aj(«+1)  = sin  (m^wAZ  + wA/)  eq.  4. 1.2-2 

= cos  (vvAZ)  sin  (wa?AZ)  + sin  (vr  AZ)  cos  (m^a/AZ) 

= cos  (v|a)  sin  (waaAZ)  + sin  (\i/)  cos  (wa/AZ) 


where  v|a  = wAZ.  We  can  rewrite  eq.  4. 1.2-1  as 


a 


(aa+1)-5,  (aa) 


eq.  4. 1.2-3 


a 


= — [cos  (v|a)  sin  (wnAt)  + sin  (v|a)  cos  (waaAZ)]  - sin  (vtaaAz) 

PI 

= ysin  (waaAz  + <!)) 


where 


Y = 


a 


\ 


cos  (\\f)  - 1 


4- 


y 


a 


^Pi 


sin  (v|a) 


y 


O = cos 


-1 


^-^cos(\|/)  - 1^ 

p5 


V 
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This  approximation  shows  that  S2(n)  is  close  to  a sinewave  if  the  PE#1  works  like  a linear 
unit.  And,  we  now  have  an  approximation  formula  to  compute  the  leading  phase,  O.  In 
Figure  4-3, 1 gives  a plot  of  O v.s.  Pj,  given  a=l  and  cos  (vj/)  = 0.8.  It  shows  that  the 
leading  phase  of  S2(n)  to  S j(n)  will  increase  with  the  value  of  P j . 

The  right-hand  side  of  the  network  consists  of  one  PE(PE#2),  two  inter-neuron 
connections  (i.e.  W21  and  W|2),  and  a self  recurrent  connection  W22-  Suppose  the  input  of 
the  PE#1  leads  the  output  in  phase  by  (])  radian.  This  is  illustrated  in  Figure  4-4.  Since  the 
phase  reversal  of  a sinewave  will  not  change  the  shape  of  the  waveform,  we  may  expect 
that  after  a successful  traitung,  either  w j2  or  W2j,  but  not  both,  must  have  a negative  sign 
such  that  the  input  of  PE#  2 can  lead  its  output  in  phase.  Therefore,  the  adaptation  of  PE#2 
and  W22  is  to  produce  the  required  phase  shift  of  zt  - <}) . 
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Figure  4-2.  Resulting  waveform  for  S2(n)  with  different  selections  of 
pj’s.  s^:pi=0.5,  s\  :pi=1.0,  5^:pi=2.0 ,5^:pi=5.0 


Figure  4-3.  Approximation  relationship  between  Pj  and  the 
leading  phase  O. 


To  ease  the  adaptation  of  the  right-hand  side  of  the  network,  we  should  choose  a 
large  Pj.  However,  since  after  training  we  need  to  connect  both  parts  of  the  network  to 
work  together,  the  sensitivity  of  the  output  of  each  PE  to  its  input  has  to  be  taken  into 
account.  The  sensitivity  of  the  PE#1  is  computed  by 

S,{n)  = p.o',  («c/. («))  eq.  4.1.2-4 

where  o',  {net.  (n)  ) is  the  derivative  of  the  activation.  This  result  shows  that  a large  P| 
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can  make  the  activation  of  the  PE#1  become  very  sensitive  to  the  input  noise.  This  conclu- 
sion can  also  be  applied  to  the  PE#2.  Therefore,  we  need  to  keep  both  Pj  and  P2  small. 
Between  these  two  contradictory  requirements,  the  best  choice  of  Pj  is  the  one  that  yields 
a phase  shift  (j)  approximately  equal  to  n/2.  This  value  can  be  determined  experimentally 
or  analytically  (from  the  linear  approximation  of  eq.  4. 1.1-4,  or  eq.  4. 1.2-3).  Once  we 
choose  Pj  equal  to  this  value,  we  can  expect  that  after  training  the  value  of  P2  should  be 
similar  to  P j because  PE#2  is  required  to  produce  the  same  phase  shift  in  its  input  and  out- 
put signals,  s 


Figure  4-4.  Phase  relationship  between  Sj(n)  and  S2(n). 


If  Sj(n)  has  too  many  samples  per  cycle,  we  may  not  be  able  to  adjust  P j to  obtain 
a phase  shift  between  S](n)  and  S2(n)  close  to  7t/2  radian.  The  adaptation  of  the  right- 
hand  side  of  the  network  will  result  in  a large  P2.  In  this  case,  PE#2  will  become  sensitive 
to  its  input  noise.  As  a result,  the  frequency  and  the  phase  of  the  network  output  will 
change  slightly  with  time.  This  may  explain  why  some  researchers  obtain  a quasiperiodic 
output  after  they  trained  a two-PE  network  to  produce  a sinewave  [Wil89][Tsu90].  This 
analysis  also  gives  us  an  upper  bound  in  selecting  the  sampling  frequency  of  a target  sig- 
nal. 


133 


To  summarize,  a successful  training  of  the  network  to  produce  a sinewave  should 
result  in  one  and  only  one  inhibitory  inter-neuron  connection.  When  we  select  Pj  such 
that  the  phase  shift  between  S}(n)  and  S2(n)  is  close  to  90  degrees,  the  adaptation  of  P2 
should  converge  to  a value  near  P j . The  sampling  frequency  of  the  target  signal  should  be 
selected  properly  to  relax  the  high  gain  requirement  and  to  avoid  frequency  shift  in  the 
output  of  the  network. 

4 13  Experimental  Results 

A 20-Hz  sinewave  was  sampled  at  220  Hz  and  used  as  the  target  signal  for  the  fol- 
lowing experiment.  The  signal  is  given  is  Figure  4. 1.3-1.  We  select  a hyperbolic-tangent 
function  as  the  activation  function  of  each  PE  and  set  Pj  equal  to  1.  The  sequence  S2(n)  is 
computed  accordingly  and  also  shown  in  Figure  4-5.  The  first  200  samples  of  Sj(n)  and 
S2(n)  are  used  for  training  purposes.  The  right-hand  side  of  the  network  was  trained  using 
the  RTRL  algorithm  without  teacher  forcing.  The  momentum  term  is  included  in  each 
recursive  update  formula.  Figure  4-6  shows  the  learning  curve  during  the  training.  We 
note  that  the  adaptation  converges  very  fast  (within  12  training  epochs).  In  Table  4-1,  the 
parameters  of  the  network  after  the  training  are  listed.  We  notice  that  there  is  only  one 
inhibitory  inter-neuron  connection,  and  p.  and  P2  are  almost  the  same. 

After  this  training,  the  network  weights  were  fixed  and  it  was  relaxed  from  the  ini- 
tial condition  (0,0).  The  input  and  the  output  of  the  first  PE  are  plotted  with  their  target 
signals  in  Figure  4-7.  The  results  show  that  the  network  can  successfully  approximate  a 
sinewave  of  the  desired  frequency  and  phase.  Since  we  designed  the  network  to  operate  in 
its  “pseudo”  linear  region,  we  can  validate  the  resulting  network  from  its  linear  approxi- 
mation model  as  shown  in  figure  4-8.  This  linear  model  has  poles  at  0.8937  ± 0.5932  i, 
which  are,  as  expected,  located  on  the  unit  circle  of  the  z-plane 
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Figure  4-5.  Waveforms  of  Si(given) 
and  S2  (computed). 


O' ^ 

0 5 10  15  20  25  30  35  40  45  5C 


Figure  4-6.  Learning  curve. 


Figure  4-7.  Outputs  of  neurons(solid  line) 
and  their  target  signals(dashed  line). 


Figure  4-8.  Linear  approximation  of 
the  network 


Table  4-1:  Parameters  of  the  resulting  network 


Preselected 

Adaptive 

parameters 

parameters 

Gain 

P2  = 0.9710 

Weights 

Wii  = 1 

W]2  = -0.4012 
W21  = 0.9053 
W22  “ 0.7874 

Bias 

0j  =0.1086 

02  = 0.0858 
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4 2 Modeling  Non-Chaotic  Time  Series 

Chapeau-Blondeau  and  Chauvet  showed  that  by  changing  one  of  the  parameters  a 
small  neural  network  can  exhibit  a wide  range  of  dynamical  behaviors  [Cha92],  However, 
based  upon  our  analysis  in  section  4.1  we  understand  that  the  sensitivity  of  a network  to 
noise  will  increase  with  the  gains  of  the  activation  functions.  If  a high  gain  is  required  for 
a network  to  reproduce  a prescribed  dynamics,  after  training  the  network  may  not  be  able 
to  sustain  the  dynamics  for  long  due  to  its  sensitivity  to  the  modeling  error.  And,  a high 
gain  is  usually  required  for  a small  network  to  exhibit  complicated  dynamics.  Therefore, 
the  range  of  dynamics  that  a small  network  can  be  trained  to  reproduce  is  very  limited.  To 
model  a signal  generated  by  a more  complicated  nonlinear  system,  we  usually  need  to  uti- 
lize a larger  network.  In  this  section,  I try  to  model  another  periodic  dynamics;  however, 
the  underlying  system  is  more  complicated  than  a single-frequency  oscillator.  The  net- 
works employed  to  model  the  dynamics  will  possess  more  than  2 nonlinear  PEs. 

The  test  signal  to  be  modeled  is  sin^^  (2ti/«A/)  with  / = 20  and  A/  = 1/ 400. 
The  waveform  of  the  signal  is  given  in  Figure  4-9.  The  signal  can  be  regarded  as  the  sam- 
pled output  of  the  following  dynamical  system 

(0  = kx  (0  cot  (2k/0  = fix  it) , 0 eq.  4.2-1 

at 


where  k is  a constant.  The  system  is  nonautonomous.  But,  since  the  vector  field  is  peri- 
odic, it  can  be  converted  to  an  autonomous  system[Par87].  This  can  be  done  by  appending 
an  extra  state  0 (/)  = 2%ft.  An  equivalent  autonomous  system  becomes 


di 


*«)  =/(*('),  ^ 


eq.  4.2-2 


^(0 

dt 


= 2ti/ 


This  underlying  system  is  highly  nonlinear.  In  Figure  4-10, 1 show  a reconstruction  of  the 
system  dynamics  with  construction  dimension  equal  to  3 and  x=l.  Although  it  is  topologi- 
cally equivalent  to  a limit  cycle,  it  contains  four  corners.  This  reconstructed  dynamics 
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reveals  the  difficulty  in  modeling  the  dynamics  using  a one-step  predictor.  Because  during 
training  most  of  the  time  a one-step  predictor  will  be  trained  to  predict  points  on  straight 
lines.  The  number  of  the  training  patterns  representing  the  mapping  at  those  four  comer 
are  almost  negligible.  After  training,  the  one-step  predictor  may  be  able  to  predict  points 
on  straight  lines  precisely,  but  perform  poorly  in  predicting  points  at  four  comers.  If  we 
iterate  the  predictor,  the  prediction  error  at  those  four  corners  will  be  accumulated  and 
amplified.  Consequently,  the  output  of  the  predictor  will  be  driven  away  from  the  original 
signal. 

In  dynamic  modeling,  the  approach  of  training  a one-step  predictor  can  be  consid- 
ered as  an  equation  error  scheme.  In  this  scheme,  we  are  trying  to  model  the  local  (one 
step  ahead)  dynamics  of  the  signal,  and  we  do  not  impose  any  constraint  on  the  iterative 
map  of  the  model  during  training.  The  question  is  “can  we  use  a precise  local  approxima- 
tion model  to  reproduce  the  global  dynamics  of  the  underlying  system?”.  The  following 
experiment  is  designed  and  conducted  to  answer  this  question. 


30 

Figure  4-9.  (a)  Waveform,  and  (b)  spectmm  for  sin 
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30 

Figure  4-10.  Reconstructed  trajectory  for  sin  In/nd^t  with  construc- 
tion dimension  equal  to  3 and  t equal  to  1 . 


The  procedure  of  this  experiment  is  basically  as  follows:  a TDNN  is  trained  as  a 
one-step  predictor  of  the  test  signal.  After  training,  we  freeze  the  parameters  of  the  net- 
work, and  the  output  of  the  network  is  connected  back  to  the  input  layer,  which  is  just  a 
delay  line.  Given  an  initial  condition,  we  can  verify  if  the  predictor  really  captures  the 
underlying  dynamics  by  comparing  the  iterates  of  the  model  with  the  original  signal.  If  the 
iterative  map  of  the  predictor  learns  the  global  dynamics,  the  output  sequence  of  the  net- 
work should  be  the  same  as  the  original  signal.  I propose  this  simple  test  to  validate  the 

dynamic  model  obtained  through  learning. 

The  TDNN  used  in  this  experiment  has  a one-hidden-layer  structure  with  8 inputs, 
15  hidden  PEs,  and  one  linear  output  unit.  I will  call  it  8-15-1  network  for  short.  A sche- 
matic diagram  of  the  network  training  is  illustrated  in  Figure  4-11.  The  signal  is  fed  into 
the  input  delay  line,  and  the  desired  output  is  the  data  sample  one-step  ahead.  The  signal 
has  200  data  samples.  In  each  training  epoch,  I present  the  whole  signal  to  the  network 
once.  A learning  curve  is  given  in  Figure  4-12.  We  notice  that  after  300  training  epochs  the 
final  mean  square  error  («10'^)  can  almost  be  ignored.  We  may  say  that  the  network 
already  learns  the  local  behavior  of  the  underlying  dynamics  very  well.  However,  after 
training  when  we  connect  the  output  of  the  network  back  to  the  input  delay  line,  the  iter- 
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ates  of  the  network  approximate  the  original  signal  very  well  only  for  the  first  40  samples. 
This  is  illustrated  in  Figure  4-13.  Obviously,  the  iterative  map  of  the  network  can  not  sus- 
tain the  desired  dynamical  behavior.  In  other  words,  a small  perturbation  introduced  by  the 
prediction  error  will  drive  the  network  dynamics  away  from  the  limit  cycle.  Accordingly, 
we  may  conclude  that  one-step  prediction  training  is  not  sufficient  to  model  the  underlying 
dynamics  of  the  signal. 


Figure  4-11.  Training  a TDNN  as  a one-step  predictor 


epoch# 


Figure  4-12  Learning  curves  for  the  training  of  both  networks. 
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Figure  4-13.  Iterates  of  the  resulting  one-step  predictor. 
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Next,  I train  a network  in  output  error  scheme.  I connect  the  output  of  the  network 
back  to  the  input  delay  line.  The  network  with  this  architecture  is  referred  to  as  TDNNGF 
(TDNN  with  global  feedback).  This  network  is  trained  to  generate  a desired  output 
sequence  by  iteration.  This  is  illustrated  in  Figure  4-14.  We  note  that  now  we  train  the  net- 
work in  the  same  way  as  we  use  it. 

Since  we  specify  each  sample  of  a desired  output  sequence,  we  need  to  use  a tra- 
jectory learning  algorithm  to  train  the  network.  In  this  experiment,  I choose  the  BPTT 
algorithm.  During  training,  I first  clock  in  8 data  samples  to  the  input  delay  line  as  an  ini- 
tial condition,  and  the  network  is  trained  to  iteratively  produce  the  next  10  data  samples. 
In  other  words,  each  training  pattern  consists  of  an  initial  condition  and  a desired  output 
sequence.  The  length  of  the  desired  sequences  is  chosen  equal  to  the  period  of  the  signal.  I 
prepare  one  training  pattern  every  4 data  samples  from  the  original  signal.  To  compare 
with  the  previous  result,  the  mean  square  error  in  one-step  prediction  is  plotted  as  the 
learning  curve,  which  is  also  given  in  Figure  4-12.  We  note  that  the  training  converges 
very  fast.  However,  some  oscillations  can  be  observed  in  the  learning  curve.  This  can  be 
interpreted  as  meaning  that  the  network  dynamics  have  switched  among  several  different 
possible  limit  cycles  before  it  converges  to  an  optimal  (or  suboptimal)  solution. 
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Figure  4-14.  Training  a TDNNGF  as  a dynamic  model 


After  training,  the  network  is  loaded  with  an  initial  condition  and  produces  a 
sequence  of  2000  samples.  In  Figure  4-15,  the  last  100  samples  of  the  output  sequence  are 
given.  We  note  that  the  shape  and  the  frequency  of  the  waveform  are  preserved.  In  Figure 
4-15,  1 also  show  the  spectrum  of  the  last  400  samples  of  the  output  sequence.  It  is  the 
same  as  the  spectrum  of  the  original  signal.  These  results  suggest  that  the  network  cap- 
tures the  dynamics  of  the  signal  at  least  in  the  time  span  of  2000  data  samples.  For  a com- 
plete model  validation,  we  need  to  iterate  the  map  of  the  network  an  infinite  number  of 
times  and  compare  the  output  with  the  original  signal.  But,  this  validation  is  not  possible 
in  practice.  Usually,  we  define  a long  time  span,  in  which  the  model  is  required  to  repre- 
sent the  dynamics  of  the  signal,  and  validate  the  model  only  in  this  time  span. 
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Figure  4-15.  (a)  output  of  the  TDNNGF  compared  with  the  original  signal, 
and  (b)  its  spectrum. 


4 3 Modeling  Chaotic  Dynamics 

4 3 1 Time  Series  with  Small  Positive  Lvapunov  Exponents 

Chaotic  time  series  possess  positive  Lyapunov  exponents.  As  a result,  even  though 
we  have  an  exact  model  we  are  still  not  able  to  reproduce  the  same  time  series.  Due  to  any 
negligible  error  in  specifying  the  initial  condition,  the  output  of  the  model  will  diverge 
from  the  signal  in  the  long  run.  Training  a model  to  reduce  this  divergence  will  not 
improve  the  accuracy  of  the  model,  but  instead  may  cause  instability  during  training. 
Therefore,  we  should  avoid  training  a model  with  long  sequences.  On  the  other  hand,  the 
training  sequences  should  be  long  enough  to  provide  the  global  information  of  the  under- 
lying dynamics.  In  this  research,  I choose  one  average  orbit  time  of  the  signal  dynamics  as 
the  lower  bound  for  the  length  of  the  training  sequences.  As  we  mentioned  in  section 
2.5. 2.2,  when  we  train  a model  with  two  sequences  starting  from  the  same  neighborhood, 
the  uncertainty  regions  around  these  two  sequences  will  overlap  after  a certain  number  of 
iterations.  When  this  overlap  occurs,  an  ambiguity  in  specifying  the  desired  mapping  may 
arise.  Using  eq.  2.5.2.2-3  and  2.5.2.2-4,  we  can  estimate  the  length  of  the  training 
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sequences  to  avoid  or  at  least  mitigate  this  problem.  If  the  positive  Lyapunov  exponents  of 
a signal  are  relatively  small,  this  length  estimate  usually  will  be  larger  than  the  average 
orbit  time  of  the  signal  dynamics.  Then,  we  can  choose  either  this  length  estimate  or  the 

average  orbit  time  as  the  length  of  the  training  sequences. 

In  this  experiment,  I use  the  mg30  signal  as  our  test  signal.  The  signal  is  obtained 
by  integrating  the  Mackey-Glass  equation  in  eq.  2.5. 1-2  with  the  4-th  order  Runge-Kutta 
method  at  a step  size  of  1 . Then,  the  signal  is  downsampled  by  6 and  normalized  to  the 
range  of  [-1,1]  for  training.  The  signal  and  its  spectrum  is  given  in  Figure  4-16. 


Figure  4-16.  (a)  Waveform,  and  (b)  spectrum  of  the  mg30  signal. 


To  estimate  the  correlation  dimension  of  the  signal,  I compute  the  CIM  curves  and 
their  slopes.  The  results  are  given  in  Figure  4-25.  The  construction  dimension  increases 
from  2 to  14  by  an  increment  of  2,  and  I select  x equal  to  3 according  to  the  mutual  infor- 
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mation  function  as  shown  in  Figure  4-17.  The  estimate  of  the  correlation  dimension  for  the 
mg30  signal  is  2.70  ±0.05.  And,  we  notice  that  the  slope  of  the  CIM  curves  saturates 
when  the  construction  dimension  is  over  6.  This  is  a lower  bound  estimate  of  the  model 
order.  Next,  I estimate  the  largest  Lyapunov  exponent  of  the  trajectory  by  using  the  algo- 
rithm proposed  by  Wolf  et  al.  The  following  parameters  are  chosen:  construction  dimen- 
sion=8,  SCALMN=  2%  of  the  diameter  estimate  of  the  signal  attractor  (e^  estimated  from 
the  CIM  curves  in  Figure  4-25),  SCALMX=10%  of  e^,  and  EVOLV=12.  The  result  is 
given  in  Figure  4-24.  The  estimate  of  the  exponent  is  0.0073  ± 0.0001  nats/sec.  This  esti- 
mate is  close  to  the  value  (0.0071  nats/sec)  reported  in  the  literature  [San85]. 


Figure  4-17.  Average  mutual  information  function  for  the  mg30  sig- 
nal. 

First,  I train  a TDNN  with  8-14-1  architecture  to  be  a one-step  predictor  of  the 
mg30  signal.  The  lower  bound  of  the  model  order  is  6,  but  I choose  8 for  a better  perfor- 
mance. The  training  signal  has  500  data  samples.  The  learning  rate  is  set  to  0.001.  I stop 
the  training  after  500  epochs.  The  learning  curve  is  given  in  Figure  4-18.  We  note  that  the 
change  of  the  mean  square  error  can  almost  be  ignored  after  first  200  epochs.  The  final 
mean  square  error  is  2.88  xl0~"*.  After  training,  I use  the  network  to  produce  3000  data 
samples  by  iteration.  The  waveform  and  the  spectrum  of  the  first  500  data  samples  are 
shown  in  Figure  4-19.  Compared  with  the  original  signal,  these  iterates  seem  much  more 
regular.. 
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Figure  4-18.  Learning  curves  for  TDNN  and  TDNNGF. 


Figure  4-19.  (a)  Waveform,  and  (b)  spectrum  of  the  iterates  of  the  TDNN. 


Next,  I train  a TDNNGF  of  the  same  size  to  model  the  dynamics.  I use  eq.  2. 5.2.2- 
3 to  estimate  the  / ’s  for  all  pairs  of  nearby  training  sequences.  The  result  is  given  in  Fig- 

s 
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ure  4-20.  The  average  of  these  / ’s  is  14,  and  this  average  is  chosen  as  the  length  of  the 
training  sequences.  Once  we  determine  the  length  of  the  training  sequences,  each  training 
pair,  including  an  initial  condition  and  a desired  output  sequence,  is  prepared  from  the  sig- 
nal every  3 points.  I use  the  BPTT  algorithm  to  train  the  network.  The  learning  curve  is 
also  given  in  Figure  4-18.  We  note  that  in  one-step  prediction  the  final  mean  square  error 
of  the  TDNN  (0.000288)  is  about  half  of  the  TDNNGF  (0.000648).  After  training,  the  net- 
work is  also  used  to  reproduce  a sequence  of  3000  data  samples.  The  waveform  and  the 
spectrum  of  the  first  500  points  are  shown  in  Figure  4-21.  The  output  waveform  of  the 
TDNNGF  is  very  close  to  that  of  the  mg30  signal.  We  also  note  that  an  accurate  prediction 
can  be  made  up  to  60  steps  ahead  with  this  initial  condition. 


Figure  4-20.  Estimates  of  /^’s. 
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Figure  4-21 . (a)  Waveform,  and  (b)  spectrum  of  the  iterates  of  the  TDNNGF. 


To  compare  the  performance  of  multi-step  prediction  for  these  two  networks,  I 
compute  the  average  prediction  error  for  each  iteration.  The  error  curves  are  shown  in  Fig- 


ure 4-22. 1 also  plot  Casdagli’s  conjecture  curve,  which  is  calculated  by 

7 X iAt  ^ 

(d.)  ■ = (c„«  ■"  ) 


eq.  4.3. 1-1 


with  = 7o. 00064 8,  X = 0.0073,  and  At  = 6.  We  note  that  the  error  curve  of  the 
network  with  global  feedback  is  very  close  to  Casdagli’s  conjecture  curve.  This  corrobo- 
rates the  fact  that  the  TDNNGF  is  a better  dynamic  model  for  the  mg30  signal. 

Next,  we  validate  the  models  according  to  the  reconstructed  dynamics  of  their  out- 
puts. I reconstruct  three  trajectories  from  the  mg30  signal  and  the  outputs  of  both  models 
in  3-dimensional  space  using  the  delay  coordinate  method  (x=3),  and  plot  their  largest  2- 
dimension  projections  in  Figure  4-23.  The  reconstructed  trajectory  from  the  output  of  the 
TDNNGF  is  very  similar  to  the  original  signal  trajectory.  To  further  quantify  this  similar- 
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ity,  I measure  the  dynamics  invariants  of  these  three  trajectories.  The  estimates  of  the  larg- 
est Lyapunov  exponents  for  the  output  of  the  TDNN  is  0.0063  ± 0.0002,  while  that  for  the 
output  the  TDNNGF  is  0.0074  ± 0.0001 . These  results  are  shown  in  Figure  4-24  with  the 
estimate  of  the  mg30  signal.  Again,  we  find  that  the  result  for  the  output  of  the  TDNNGF 
is  very  close  to  that  of  the  mg30  signal.  In  Figure  4-25,  we  estimate  the  correlation  dimen- 
sions for  the  outputs  of  both  networks  and  the  mg30  signal.  The  correlation  dimension 
estimate  is  2.70  ± 0.05  for  the  mg30  signal,  2.65  + 0.03  for  the  output  of  the  TDNNGF, 
and  1.60  + 0.10  for  the  output  of  the  TDNN.  Again,  these  results  show  that  the  output  of 
the  TDNNGF  can  preserve  the  characteristics  of  the  original  signal  very  well.  The  dimen- 
sion estimate  of  the  output  of  the  TDNN  is  much  smaller  than  the  other  two  estimates. 
This  implies  that  the  dynamics  of  the  TDNN  is  less  complicated  than  the  original  system. 


Figure  4-22.  Mean  square  errors  in  multi-step  prediction  (by  iter- 
ation) of  both  networks. 
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Figure  4-23.  2-D  projections  of  the  reconstruction  trajectories  from  (a)  the  mg30 
signal,  (b)  the  output  of  the  TDNNGF,  and  (c)  the  output  of  the  TDNN. 


From  all  of  these  comparisons,  we  may  conclude  that  the  TDNNFG  captures  the 
underlying  dynamics  of  the  mg30  signal  at  least  for  the  time  span  of  3000  data  samples. 
Although  the  TDNN  performs  better  than  the  TDNNGF  in  one-step  prediction,  the  rest  of 
the  tests  shows  that  the  TDNN  can  not  be  used  as  a dynamic  model  of  the  mg30  signal. 
Therefore,  an  accurate  one-step  predictor  does  not  guarantee  accurate  dynamic  modeling. 

In  the  previous  experiment,  I chose  the  length  of  the  training  sequences  according 
to  my  proposed  method.  If  we  prepare  longer  training  sequences,  we  may  observe  some 
oscillations  in  the  learning  curve.  The  following  experiment  is  designed  to  illustrate  this 
problem. 

In  this  experiment,  I train  a TDNNGF  of  the  same  size  to  model  the  signal.  But  the 
length  of  training  sequences  is  increased  from  14  samples  to  20  samples.  In  Figure  4-26, 
we  note  that  the  performance  of  the  network  in  one-step  prediction  is  very  poor,  and  the 
error  can  not  be  reduced  further  as  the  training  proceeded.  After  training,  3000  samples 
are  generated  by  iteration.  The  waveform  and  the  spectrum  for  the  first  500  iterates  are 
shown  in  Figure  4-27.  From  the  spectrum,  we  notice  that  the  output  becomes  a quasi-peri- 
odic  signal. 
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This  experiment  shows  that  without  any  method  to  estimate  a proper  length  to  pre- 
pare the  training  sequences,  we  may  waste  a lot  of  time  in  network  training,  and  fail  to 
capture  accurately  the  dynamics  of  the  underlying  system. 


nats/sec 


Figure  4-24.  Run-time  Lyapunov  exponent  estimations 
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Figure  4-25.  CIM  curves  and  their  slope  estimations  for  (a)(b)  the  mg30  signal, 
(c)(d)  the  output  of  the  TDNNGF,  and  (e)(f)  the  output  of  the  TDNN. 


Figure  4-26.  Learning  curve  for  long  training  sequences. 
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Figure  4-27.  (a)  Waveform  and  (b)  spectrum  for  the  output  of  the  TDNNGF  trained 
with  long  sequences. 


4 3.2  Time  Series  with  Large  Positive  Lvapunov  Exponents 

Some  chaotic  signals  can  have  very  large  positive  Lyapunov  exponents  such  that 
the  estimation  of  the  length  of  training  sequences  using  eq.  2. 5. 2. 2-3  may  yield  just  a few 
data  samples.  The  training  sequences  of  a short  length  can  hardly  represent  the  global 
information  about  the  underlying  dynamics.  In  section  2. 5. 2. 3,  I proposed  a two-stage 
training  procedure  for  this  case.  In  the  first  stage,  we  train  a model  as  a one-step  predictor 
of  a given  signal  using  the  equation  error  scheme.  In  this  phase,  the  model  will  approxi- 
mate the  local  behavior  of  the  signal.  When  the  average  one-step  prediction  error  of  the 
model  is  smaller  than  an  error  threshold,  we  switch  the  training  to  the  second  phase.  In  the 
second  stage,  we  train  the  model  using  the  proposed  selectable-target-sequence  training 
method.  In  this  method,  we  load  an  initial  condition  into  the  model,  and  iterate  the  model 
to  produce  a sequence  of  certain  length.  Then,  we  compare  the  sequence  with  several  pos- 
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sible  target  sequences  whose  initial  conditions  are  located  in  the  same  neighborhood,  and 
select  the  one  closest  to  the  model  output  as  the  target  sequence  to  compute  the  error.  In 
this  section,  this  training  procedure  will  be  implemented  to  model  a chaotic  signal  with 
large  Lyapunov  exponents. 

The  test  signal  used  in  the  following  experiments  is  the  Lorenz  signal  (with  o = 10, 
r = 28,  b = 8/3).  The  sampling  rate  of  the  signal  is  set  to  10  Hz  to  avoid  the  requirement  of 
a large-size  neural  network.  The  signal  and  its  spectrum  are  given  in  Figure  4-28.  The 
CIM  curves  and  their  slope  estimation  is  shown  in  Figure  4-29.  The  construction  dimen- 
sion is  increased  from  2 to  14  with  an  increment  of  2,  and  I choose  x equal  to  1 according 
to  the  estimate  of  the  average  mutual  information  function.  The  estimate  of  the  correlation 
dimension  is  2.04  ±0.01.  The  Largest  Lyapunov  exponent  is  computed  using  the  algo- 
rithm proposed  by  Wolf  et  al.  The  following  parameters  are  chosen;  SCALMN  = 2%  of 
the  diameter  estimate  of  the  signal  attractor,  SCALMX  = 10%  of  the  attractor  diameter 
estimate,  EVOLV  = 10,  construction  dimension  = 6.  The  result  is  shown  in  Figure  4-30. 
This  estimate  is  2.19  ±0.02  bits/sec.  We  notice  that  the  positive  Lyapunov  exponent  of 
the  Lorenz  signal  is  very  large. 

Again,  we  first  train  a TDNN  as  a one-step  predictor  of  the  signal.  The  architecture 
of  the  TDNN  is  8-16-1.  The  output  unit  is  linear.  The  signal  used  in  training  has  500  data 
samples.  The  learning  rate  is  set  to  0.01,  and  the  training  is  stopped  after  5000  epochs.  The 
learning  curve  is  given  in  Figure  4-31.  We  note  that  the  learning  curve  first  goes  down, 
and  then  bounces  back  before  it  converges.  This  may  be  interpretered  as  the  existence  of 
some  contradictory  requirements  in  the  desired  mapping.  In  other  words,  because  the  pos- 
itive Lyapunov  exponent  of  the  signal  is  so  large,  points  in  the  same  neighborhood  may 
have  next  points  scattered  around  the  attractor,  which  causes  the  problem  for  the  one-step 
prediction.  The  final  mean  square  error  is  5.03  x lO”"*. 
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Figure  4-28.  (a)  Waveform,  and  (b)  spectrum  of  the  Lorenz  signal. 


Figure  4-29.  (a)  CIM  curves,  and  (b)  their  slope  estimates  for 
the  Lorenz  signal. 
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Figure  4-30.  Run-time  maximal  Lyapunov  exponent  estimate 
for  the  Lorenz  signal. 

After  training,  I iterate  the  resulting  network  to  produce  a sequence  of  500  data 
samples.  The  sequence  is  shown  in  Figure  4-32.  The  waveform  does  not  look  even  close 
to  the  original  signal  at  all.  By  this  visual  inspection,  we  already  can  conclude  that  the 
TDNN  does  not  learn  the  dynamics  of  the  signal.  We  can  plug  the  final  mean  square  error 
obtained  in  this  training  into  eq.  2. 5.2. 3-3  to  estimate  /^’s.  The  estimation  results  are 
shown  in  Figure  4-33.  The  average  of  these  estimates  is  just  0.34.  Therefore,  we  are  not 
able  to  train  a network  using  normal  sequence  training  method. 


Figure  4-3 1 . Learning  curve  for  the  TDNN. 
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Figure  4-32.  Iterates  of  the  TDNN  after  training. 


Figure  4-33.  Estimation  of  i^’s  for  the  Lorenz  signal. 


Next,  I train  a TDNNGF  of  the  same  size  to  model  the  signal  using  my  proposed  2- 
stage  training  procedure.  In  the  first  phase,  the  network  is  trained  as  a one-step  predictor. 
In  other  words,  the  feedback  connection  is  taken  out,  and  the  signal  is  directly  clocked  into 

_3 

the  network  input  layer.  After  the  mean  square  error  becomes  smaller  than  3.0  x 10  , 1 
start  using  the  selectable-target  sequence  training  method  to  train  the  network.  The  length 
of  the  training  sequences  equal  to  8 is  chosen.  The  learning  rate  is  set  to  0.01,  and  the 
BPTT  algorithm  is  employed  to  update  the  network  parameters.  I stop  the  training  after 
5000  epochs.  The  learning  curve  is  plotted  according  to  the  performance  of  the  network  in 
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one-step  prediction,  and  is  shown  in  Figure  4-34.  We  note  that  after  we  switch  to  the 
selectable-target-sequence  training  oscillations  can  be  observed  from  time  to  time.  How- 
ever, the  error  has  a tendency  to  decrease,  and  the  amplitude  of  the  oscillations  becomes 
smaller  and  smaller. 


After  training,  given  an  initial  condition,  a sequence  of  3000  data  samples  was  pro- 
duced by  the  network  by  iteration.  The  waveform  and  the  spectrum  of  the  first  1000  sam- 
ples are  shown  in  Figure  4-35.  For  comparison,  a sequence  of  the  Lorenz  signal  starting 
from  the  same  initial  condition  is  also  given  in  Figure  4-35.  Although  the  amplitude  of  the 
network  output  does  not  match  that  of  the  original  signal  perfectly,  we  notice  that  most  of 
the  patterns  contained  in  the  original  signal  can  also  be  found  in  the  network  output.  But 
their  sequential  order  are  rearranged  in  the  waveform  of  the  network  output.  We  also  note 
that  the  spectrum  of  the  network  output  is  similar  to  the  spectrum  of  the  original  signal. 
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Figure  4-35.  (a)  Waveform  of  the  TDNNGF  output,  compared  with  (b) 
the  original  signal,  and  (c)  spectrum  of  the  output. 


In  Figure  4-36, 1 show  2-D  projections  of  the  trajectories  reconstructed  from  the 
model  output  and  the  original  signal.  They  look  similar,  but  we  can  observe  some  over- 
shoots in  the  trajectory  of  the  model  output.  We  further  test  the  model  according  to  the 
measurements  of  the  dynamical  invariants.  First,  we  compute  the  CIM  curves  and  esti- 
mate the  correlation  dimension.  The  result  is  shown  in  Figure  4-37.  The  dimension  esti- 
mate from  the  network  output  is  2.22  + 0.02.  This  estimate  is  higher  than  that  of  the 
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original  signal.  A possible  explanation  is  that  the  modeling  error  may  introduce  some 
high-dimensional  noise  into  the  network  dynamics.  Next,  we  estimate  the  largest 
Lyapunov  exponent.  The  result  is  shown  in  Figure  4-38.  The  estimate  of  the  largest 
Lyapunov  exponent  is  2.20  ± 0.02  bits/sec.  This  estimate  is  very  close  to  that  of  the  origi- 
nal signal. 


Figure  4-36.  2-D  projections  of  the  reconstructed  trajectories 
for  (a)  original  signal,  and  (b)  the  model  output. 


Based  upon  these  test  results,  we  may  say  that  the  resulting  network  has  learned 
the  dynamics  of  the  Lorenz  signal  to  some  extent.  Although  the  network  dynamics  seem  a 
little  more  complicated  than  the  underlying  dynamics  of  the  signal,  some  important  char- 
acteristics of  the  signal  are  modeled  very  well.  From  this  point  of  view,  the  proposed  two- 
stage  training  method  is  superior  to  a one-step-prediction  training.  Therefore,  this  method 
can  be  regarded  as  a remedy  when  a normal  sequence  training  can  not  be  applied  due  to 
the  instability  problem. 

In  this  experiment,  the  selections  of  the  error  threshold  and  the  length  of  the  train- 
ing sequences  are  empirical.  If  a large  error  threshold  is  selected,  the  network  may  not  be 
able  to  learn  the  local  behaviors  of  the  signal.  Then,  the  second-stage  training  will  become 
similar  to  a normal  sequence  training.  The  training  may  never  converge.  On  the  other 


159 


hand,  if  we  choose  a small  error  threshold,  we  may  observe  some  oscillations  with  very 
high  amplitude.  The  training  may  diverge  due  to  these  oscillations.  As  to  the  selection  of 
the  length  of  the  training  sequence,  it  seems  that  there  exists  an  upper  bound.  When  the 
training  sequence  is  longer  than  the  value,  the  training  will  never  converge.  At  this  time,  a 
systematic  approach  to  decide  these  parameters  is  still  under  investigation. 
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Figure  4-37.  (a)  CIM  curves,  and  (b)  their  slope  estimation  for 
the  TDNNGF  output. 
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Figure  4-38.  Run-time  Lyapunov  exponent  estimation  for  the  TDNNGF  output. 
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4.4  Noise  Reduction 

In  section  2.4.1,  we  reviewed  the  method  proposed  by  Landa  and  Rosenblum  to 
remove  high-dimensional  noise  from  time  series.  In  this  method,  we  first  reconstruct  a 
state-space  trajectory  from  a noise-contaminated  signal,  determine  the  basis  vectors  and 
the  dimension  of  the  signal  embedding  space,  project  the  trajectory  onto  this  subspace, 
and  recover  a time  series  from  the  projected  trajectory.  In  this  manner,  high-dimensional 
noise  can  be  removed  from  the  signal.  This  can  be  regarded  as  a filtering  process  in  state 
space.  The  determination  of  the  signal  embedding  subspace  in  this  state-space  filtering  is 
analogous  to  the  selection  of  a cutoff  frequency  in  frequency-domain  filtering.  As  noise 
may  exist  in  the  passbands  of  a frequency  filter,  the  residue  noise  still  corrupts  the  projec- 
tion of  the  trajectory  in  the  embedding  subspace.  The  noise  effect  will  contribute  to  the 
jaggedness  of  the  trajectory.  In  this  section,  I propose  to  use  an  artificial  neural  network, 
trained  as  a trajectory  predictor,  to  further  remove  the  noise  in  the  embedding  subspace. 

This  approach  is  similar  to  the  work  of  Kostelich  and  Yorke,  which  was  also  dis- 
cussed in  section  2.4.1.  They  proposed  to  use  linear  models  to  approximate  the  local 
behaviors  of  a reconstructed  trajectory.  Then,  these  local  models  are  employed  to  smooth 
the  jagged  trajectory,  and  noise  can  therefore  be  reduced  further.  For  a discrete-time  case, 
eq.  2.4. 1-1  can  be  rewritten  as 

x(t)  = ^Jx(t-\)  +b  eq.  4.4-1 

In  fact,  this  model  is  a one-step  predictor  of  the  local  state  vectors.  The  parameters  of  this 
local  model  in  a small  neighborhood  can  be  determined  by  finding  the  best  fit  to  the  points 
in  the  region.  A linear  model  can  approximate  the  dynamics  very  well  only  in  a small  area. 
To  obtain  enough  points  for  an  accurate  parameter  estimation  for  this  local  linear  model 
usually  requires  a long  trajectory  which  will  travel  the  region  frequently  enough.  In  other 
words,  a long  data  recording  is  required  in  the  reconstruction.  In  addition,  if  the  system  is 
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highly  nonlinear,  the  region  where  this  linearization  holds  will  shrink.  In  this  case,  we  may 
need  to  partition  the  construction  space  into  more  smaller  regions. 

Since  the  artificial  neural  network  is  a powerful  nonlinear  modeling  tool,  I propose 
to  train  a neural  network  as  a one-step  predictor  of  the  projection  of  a reconstructed  trajec- 
tory. Training  a network  as  a one-step  predictor,  we  are  also  modeling  the  local  dynamics 
of  a trajectory.  But,  in  this  approach  we  just  use  a single  model.  Moreover,  since  the  model 
is  nonlinear  it  can  approximate  a nonlinear  dynamics  better  than  a linear  model.  However, 
this  nonlinear  mapping  capability  also  brings  another  concern.  Let  us  examine  two  recon- 
structed  state  vectors,  y{t)  and  1),  that  are  consecutive  in  time.  They  can  be 
expresses  as 


and 

with 


T(0  - x{t)  x(/+l)  ...  x{t  + 2m 


>’(/+l)  = x(/+l)  x(t  + 2)  ...  x{t  + 2m  + 1) 


x(t)  = d(t)  +n(t) 


eq.  4.4-2 


where  d(t)  is  the  noise-free  data  sample  at  time  t,  and  n(t)  is  noise.  Here,  I assume  x equal 
1 in  the  reconstruction.  We  note  that  the  noise  components  in  y (t)  and  y{t+  1)  are  no 
longer  uncorrelated,  or  the  noise  components  in  y (t)  can  now  be  used  to  estimate  to  some 
extent  the  noise  components  in  + 1 ) . If  we  choose  different  values  of  x,  we  will  obtain 
different  correlation  relationship  among  the  noise  components  in  these  close-in-time  state 
vectors.  The  correlation  of  the  noise  components  in  the  projection  of  a reconstructed  tra- 
jectory can  also  be  expected.  A powerful  nonlinear  model  like  an  artificial  network  may 
develop  a mapping  that  combines  the  dynamics  of  the  signal  with  the  noise  during  train- 
ing. 

A fundamental  hypothesis  of  my  proposed  method  is  that  the  neural  network  will 
learn  first  the  signal  trajectory  before  it  starts  learning  the  noise  that  corresponds  to  jag- 
gedness superimposed  on  the  signal  trajectory.  The  training  of  the  network  has  to  be 
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stopped  at  the  point  that  most  of  the  system  dynamics  have  been  learned  and  before  the 
training  leads  to  an  incorporation  of  the  noise  in  the  model. 

Since  the  memory  depth  of  the  gamma  memory  required  for  one-step  prediction 
will  be  different  for  the  signal  dynamics  and  the  noise  dynamics,  I proposed  to  use  the 
property  of  the  adaptive  memory  depth  of  gamma  networks  to  create  a stop  criterion  for 
the  training.  The  memory  depth  for  a gamma  memory  is  estimated  as  K/ p.  if  p < 1 . When 
p = 1 , the  memory  depth  is  the  shortest  (equal  to  the  order  of  the  memory),  and  when  we 
decrease  the  value  of  p,  the  depth  of  the  memory  will  progressively  increase.  In  the  case  of 
1 < p < 2 , the  memory  depth  is  defined  as  K/  (2  - p) . If  we  increase  the  value  of  p,  the 
memory  depth  will  increase  accordingly.  However,  a gamma  memory  acts  as  a low-pass 
filter  when  p < 1 and  as  a high-pass  filter  when  1 < p < 2.  The  topology  of  the  network 
that  I used  in  this  approach  is  shown  in  Figure  4-39.  Each  input  unit  of  the  network  is 
equipped  with  a gamma  memory.  Therefore,  this  network  can  be  viewed  as  a multi-chan- 
nel focused  gamma  network  [Pri92a].  There  is  one  input  unit  for  each  coordinate  of  a state 
vector.  The  network  is  trained  to  predict  the  next  state  vector.  During  training,  we  adjust 
both  connection  weights  and  gamma  memory  parameters  to  minimize  the  prediction  error. 
If  the  hypothesis  is  true,  at  the  very  beginning  of  the  training  the  gamma  memory  parame- 
ters, p’s,  should  be  adapted  to  some  values  that  reflect  the  required  memory  depth  for 
modeling  signal  dynamics.  When  the  network  starts  learning  noise  dynamics  we  should  be 
able  to  observe  the  change  of  these  parameters.  Thus,  inspection  in  the  learning  curve  of 
the  p’s  can  serve  as  an  indicator  about  when  to  stop  training. 

To  test  this  hypothesis,  I used  the  same  test  signals  given  in  section  2.4.1.  I pre- 
pared the  training  patterns  by  using  the  embedding-space  projections  of  the  trajectories 
(i.e.  U ).  This  selection  is  based  on  the  following  reasons.  First,  the  projection  contains 
less  noise.  Second,  the  required  number  of  input  units  can  be  minimized.  And  third,  the 
coordinates  of  each  projected  state  vector,  which  are  the  projections  on  the  principal  axes, 
are  uncorrelated.  All  of  these  will  expedite  the  training  a lot  [Hon84].  In  the  topology  of  a 
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network,  the  number  of  the  hidden  PEs  is  associated  with  the  number  of  degrees  of  free- 
dom to  “discover”  the  mapping  during  training.  Since  we  want  to  suppress  the  noise  from 
the  learning  dynamics,  the  number  of  hidden  PEs  should  be  small,  but  large  enough  to 
approximate  well  the  input-output  map  characterizing  the  dynamical  system.  As  a rule  of 
thumb,  the  number  of  hidden  PEs  can  be  set  comparable  to  the  number  of  the  network 
input  units.  The  order  of  each  gamma  memory  (K)  is  fixed  at  two.  Because  from  the 
experiments  I found  that  the  change  in  p’s  is  more  noticeable  for  a smaller  memory  order. 
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Figure  4-39.  Multi-channel  focused  gamma  network  for  noise  reduction 


After  training,  we  clock  in  the  projection  trajectory  to  obtain  a new  trajectory, 
which  will  be  smoother  if  the  training  is  successful.  Then,  we  can  use  eq.  2.4. 1-5  to 
recover  a time  series. 

The  topology  of  the  network  for  the  learning  of  the  van  der  Pol  attractor  is:  2 input 
units,  10  hidden  PEs,  and  2 linear  output  units,  and  that  for  the  Lorenz  attractor  is:  4 input 
units,  20  hidden  PEs,  and  4 linear  output  units.  The  connection  weights  of  both  networks 
are  trained  using  the  backpropagation  algorithm,  and  the  p’s  are  trained  by  using  the 
RTRL  algorithm.  I prepared  1200  training  patterns  and  500  test  patterns  for  both  attrac- 
tors. Since  the  noise-free  signals  are  known,  we  are  able  to  compute  the  signal-to-noise 
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ratio  (SNR)  at  the  end  of  each  training  epoch  (1200  training  patterns/epoch).  In  Figure  4- 
40,  I show  the  learning  curve  of  the  network,  the  adaptation  curve  of  p’s,  and  the  SNR 
curve  for  the  training  of  the  van  der  Pol  attractor.  And,  these  curves  for  the  training  of  the 
Lorenz  attractor  are  given  in  Figure  4-41 . 

Several  important  observations  can  be  extracted  from  these  plots.  First,  our 
hypothesis  that  the  signal  trajectories  are  first  learned  is  verified  in  the  SNR  curves,  since 
these  curves  increase  and  then  decrease  as  the  training  proceeds.  This  means  that  noise 
learning  is  concentrated  in  the  later  portion  of  the  training.  Second,  it  is  obvious  that  the 
learning  curves  of  both  networks  do  not  contain  any  direct  information  to  stop  training  for 
the  best  performance.  Third,  we  note  that  during  the  first  several  training  epochs  the  p’s 
will  reach  some  stable  values.  At  the  same  time,  the  SNR  curves  also  have  their  peaks. 
When  the  p’s  start  being  driven  away  from  these  values,  the  SNR  curves  drop.  These 
results  suggest  that  the  change  of  the  p’s  during  training  can  really  be  used  as  an  indicator 
to  stop  training. 

As  indicated  in  Figure  4-40  and  Figure  4-41,  I stop  the  training  for  the  prediction 
of  the  van  der  Pol  attractor  after  7 training  epochs  and  for  the  prediction  of  the  Lorenz 
attractor  after  5 epochs.  The  recovered  time  series  and  their  spectra  are  given  in  Figure  4- 
42.  We  note  that  the  SNR  of  the  van  der  Pol  signal  is  improved  to  19db,  while  the  SNR  of 
the  Lorenz  signal  is  raised  to  16.2  db.  We  also  notice  that  the  spectra  of  both  recovered 
time  series  follow  the  spectra  of  the  original  signals  very  well  even  in  the  high-frequency 
band.  For  comparison,  I directly  filtered  the  Lorenz  signal  using  a 26-th  order  low-pass 
FIR  filter  with  cutoff  frequency  at  60  in  the  scale  of  the  plot.  The  resulting  signal  and  its 
spectrum  are  also  given  in  Figure  4-42.  Although  the  SNR  of  the  signal  can  be  raised  to 
15.11  db.  We  notice  that  the  spectrum  of  the  resulting  signal  is  severely  distorted  in  the 
high-frequency  band. 

There  are  some  cases  in  which  the  adaptation  curves  of  p’s  do  not  change  at  the 
same  time  after  they  reached  some  stable  values  at  the  beginning.  In  other  words,  the  noise 
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projections  on  different  principal  axes  may  be  learned  at  different  time.  In  these  cases,  we 
could  use  different  learning  rates  for  the  adaptations  of  p’s. 

At  this  time,  we  still  do  not  have  an  explicit  stop  criterion  for  network  training. 
However,  since  we  propose  to  stop  training  when  the  learning  curves  of  p’s  first  reach  sat- 
uration values,  a possible  criteron  is  to  determine  when  the  slopes  of  these  curves  become 
zero. 

To  summarize,  this  state-space  noise  reduction  procedure  includes  trajectory 
reconstruction,  embedding  space  projection,  training  of  a state  predictor,  and  time  series 
recovery.  In  this  approach,  the  knowledge  about  the  signal  spectrum  is  not  required.  All  of 
the  design  parameters  can  be  estimated.  The  results  show  that  this  method  can  really 
reduce  noise  and  still  preserve  very  well  the  spectrum  of  the  original  signal.  However,  we 
only  apply  this  method  to  the  signals  generated  by  low-dimensional  chaotic  systems. 
When  a system  is  more  complicated,  it  may  become  extremely  difficult  to  distinguish  the 
output  of  the  system  from  noise.  Consequently,  neither  this  method  nor  a linear  filter  can 
remove  noise  effectively. 
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Figure  4-40.  (a)  Learning  curves  (dark  line:  training  set,  gray  line:  test  set), 
(b)  adaptation  curves  of  p’s,  and  (c)  SNR  curve  for  the  training  of  the  van 
der  Pol  attractor.  In  (b),  the  number  corresponding  to  the  order  of  the  princi- 
pal axes. 
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Figure  4-41.  (a)  Learning  curves  (dark  line;  training  set,  gray  line:  test 
set),  (b)  adaptation  curves  of  p’s,  and  (c)  SNR  curve  for  the  training  of 
the  Lorenz  attractor.  In  (b),  the  number  corresponding  to  the  order  of  the 
principal  axes. 
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Figure  4-42.  Recovered  signals  and  their  spectra  (a)(b)  for  the  van  der  Pol 
signal,  (c)(d)  for  the  Lorenz  signal.  (e)(f)  are  the  low-pass  filtered  results. 


CHAPTER  5 

CONCLUSIONS  AND  FUTURE  DIRECTIONS 


5 1 Conclusions 

The  intensive  study  of  chaotic  dynamics  started  in  the  last  decade.  And,  the 
research  in  chaotic  time  series  modeling  is  still  in  its  infancy.  Since  the  underlying  system 
of  a chaotic  time  series  is  nonlinear,  conventional  system  identification  techniques  are  not 
applicable.  In  this  research,  a methodology  is  established  to  model  the  underlying  nonlin- 
ear dynamics  from  a signal.  Due  to  its  powerful  function  approximation  capability,  artifi- 
cial neural  networks  are  chosen  as  the  modeling  tool. 

In  conventional  signal  modeling,  a one-step  predictor  is  built  from  a signal  and  the 
model  is  treated  as  an  inverse  of  the  system  that  generated  the  signal.  This  approach  has 
also  been  adopted  by  most  research  groups  in  modeling  chaotic  time  series.  Based  on  anal- 
ysis and  experimental  results,  we  conclude  that  training  a model  as  a one-step  predictor  is 
not  sufficient  to  capture  the  underlying  nonlinear  dynamics  of  a signal.  Trajectory  learn- 
ing, in  which  the  iterative  map  of  a predictive  model  is  constrained,  is  shown  to  able  to 
yield  a better  dynamic  model. 

In  this  methodology,  I propose  to  use  TDNNs  with  global  feedback  (TDNNGFs)  as 
the  model  structure.  I train  the  model  using  a trajectory  learning  algorithm  to  enhance  the 
fitness  of  its  iterative  map  to  the  original  system  dynamics.  Some  design  parameters  are 
determined  according  to  the  measurements  of  dynamical  invariants.  I propose  to  estimate 
a lower  bound  of  the  model  order  by  inspecting  the  saturation  of  the  CIM  (Correlation 
Integral  Map)  curves  computed  from  the  signal.  For  chaotic  time  series,  the  determination 
of  the  length  of  training  sequences  is  crucial  because  long  sequences  may  cause  instability 
during  training,  and  short  sequences  can  not  provide  global  information  of  the  underlying 
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attractor.  According  to  the  largest  Lyapunov  exponent  of  a chaotic  signal,  I suggest  a crite- 
rion to  estimate  the  maximal  length  of  training  sequences  to  avoid  or  at  least  mitigate 
instability  during  model  training.  If  the  Lyapunov  exponent  of  a chaotic  signal  is  too  large, 
this  estimate  may  become  very  small.  To  be  able  to  train  a model  to  learn  trajectories  for 
this  type  of  chaotic  signals,  I propose  a selectable-target-sequence  training  method.  In  this 
method,  the  desired  output  sequence  is  designated  according  to  the  dynamical  behavior  of 
the  signal  in  the  same  neighborhood  and  the  iterate  of  the  model.  Experimental  results 
show  that  the  proposed  criterion  is  a good  upper  bound  estimate  for  the  length  of  training 
sequences,  and  the  selectable-target-sequence  training  method  can  yield  a better  dynamic 
model  without  causing  severe  oscillations  during  training.  Using  this  method,  we  show 
promising  results  for  modeling  the  Lorenz  system,  which  has  not  been  successfully  mod- 
eled before  through  prediction. 

To  validate  the  resulting  model,  the  mean  square  error  in  one-step  prediction  is 
found  insufficient  to  decide  the  performance  of  a dynamic  model.  For  chaotic  time  series, 
a sample-by-sample  comparison  of  the  iterates  of  the  model  and  the  original  signal  is  inap- 
propriate due  to  the  existence  of  positive  Lyapunov  exponents.  And,  noise-like  broadband 
spectra  of  chaotic  time  series  make  the  model  validation  in  frequency-domain  difficult.  In 
this  research,  I propose  to  validate  the  dynamic  model  according  to  its  state-space  dynam- 
ical properties  as  follows:  we  iterate  the  model  to  produce  a time  series,  measure  the 
dynamical  invariants  of  the  trajectory  reconstructed  from  this  time  series,  and  compare  the 
measurements  with  those  of  the  trajectory  reconstructed  from  the  original  signal. 

Trajectory  reconstruction  is  an  important  procedure  in  this  research.  The  quality  of 
a reconstruction  will  affect  the  measurement  of  the  dynamics  invariants.  In  this  work,  I 
propose  a new  method  to  reconstruct  state  vectors.  The  elements  of  these  vectors  are  the 
Poisson  Moment  functionals  of  a given  signal,  and  they  can  be  obtained  directly  by  sam- 
pling the  outputs  of  a Poisson  Filter  Chain.  This  reconstruction  method  can  preserve  the 
dynamical  invariants  and  is  shown  to  be  robust  when  the  signal  is  noisy.  In  measuring 
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dynamical  invariants,  I choose  the  GP  algorithm  to  compute  the  correlation  dimension.  I 
suggest  a modification  to  reduce  the  knee  effect  in  the  CIM  curves.  The  estimation  of  the 
correlation  dimension  from  these  curves  will  become  more  robust  even  with  small  amount 
of  data  samples.  For  the  estimation  of  the  largest  Lyapunov  exponent,  I use  the  algorithm 
proposed  by  Wolf  et  al.,  and  I select  the  parameters  required  by  this  algorithm  according  to 
the  suggestion  given  by  Principe  and  Lo. 

As  a preprocessing  procedure  of  dynamic  modeling,  a state  space  noise  reduction 
technique  is  proposed.  A reconstructed  trajectory  from  a signal  is  projected  onto  the 
embedding  space  of  the  underlying  dynamics  to  remove  high-dimensional  noise.  To  fur- 
ther reduce  the  noise  in  the  embedding  space,  a multi-channel  MLP  with  gamma  memo- 
ries attached  to  its  input  units  is  trained  as  a state  predictor  for  the  projected  trajectory.  I 
propose  to  inspect  the  change  in  the  learning  curves  of  the  gamma  memory  parameters  to 
stop  the  training  such  that  the  neural  network  can  learn  more  about  the  signal  dynamics 
and  less  about  the  noise  dynamics.  Experimental  results  show  that  using  this  criterion  the 
training  can  be  stopped  at  the  peak  of  the  signal-to-noise  ratio  curve  or  its  vicinity.  Conse- 
quently, the  output  of  the  network  will  contain  less  noise  components.  Because  this 
method  will  not  severely  distort  a signal  in  some  prescribed  frequency  bands,  the  dynami- 
cal properties  of  the  signal  can  be  preserved  better.  Another  advantage  of  this  noise  reduc- 
tion procedure  is  that  very  little  knowledge  of  the  signal  is  required.  Most  design 
parameters,  e.g.  dimension  of  the  embedding  space,  can  be  estimated  from  the  signal 
directly.  The  noise  reduction  stage  will  be  fundamental  when  we  attempt  to  perform 
dynamic  modeling  with  experimental  data,  the  next  step  of  this  research. 

This  proposed  methodology  has  been  tested  for  some  equation-generated  signals. 
The  results  are  encouraging.  However,  the  underlying  systems  of  these  test  signals  are  of 
relatively  low  dimension.  In  other  words,  the  systems  have  just  a few  degrees  of  freedom. 
When  we  apply  this  methodology  to  model  either  the  outputs  of  more  complicated  sys- 
tems or  real-life  chaotic  signals,  some  adjustments  will  be  needed. 
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5 2 Future  Directions 

The  modeling  of  chaotic  time  series  is  still  a research  field  to  be  explored.  The 
improvement  of  modeling  techniques  will  progress  with  our  understandings  of  chaotic 
dynamics.  As  more  real-life  signals  are  identified  as  chaotic  time  series,  a robust  method- 
ology for  nonlinear  dynamic  modeling  is  in  great  demand.  A more  strict  model  validation 
procedure  is  also  highly  desirable. 

Since  the  underlying  attractor  is  not  given,  we  train  a dynamic  model  with  only  a 
single  reconstructed  trajectory.  In  model  validation,  we  also  compare  the  dynamical 
descriptors  measured  from  a finite  segment  of  the  model  output  and  the  original  signal. 
The  performace  of  the  dynamic  model  and  the  accuracy  of  the  model  validation  relies 
heavily  on  the  size  of  the  available  data  sets.  A statistical  analysis  to  establish  a scaling 
law  for  the  size  of  the  data  sets  is  definitely  of  the  highest  priority  in  the  list  of  the  future 

work. 

At  this  time,  there  are  still  some  other  open  issues  in  the  proposed  methodology. 
These  issues  include:  how  to  decide  the  number  of  the  hidden  PEs  or  the  hidden  layers  of 
a TDNNGF  in  modeling.  How  to  choose  the  parameters  required  in  the  selectable-target- 
sequence  training  method,  how  to  precisely  decide  the  value  of  X,  or  the  bandwidth  of  a 
PFC,  in  the  proposed  trajectory  reconstruction  method  such  that  the  dynamical  invariants 
can  be  well  preserved.  These  issues  need  to  be  further  studied.  In  this  work,  the  1-2  norm 
has  been  used  as  the  distance  metric  in  the  cost  function.  Some  other  f,-p  norms  may  yield 
better  results  if  we  can  work  around  the  differentiability  problem  (e.g.  using  approxima- 
tion). For  example,  the  f-oo  norm  can  yield  an  error  of  uniform  distribution,  with  which  the 
training  of  the  iterative  map  of  a dynamic  model  may  be  more  effective.  Therefore,  this 
topic  is  also  worth  pursuing  further. 
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For  real-life  applications  of  this  methodology,  modeling  EEG  (electroencephalo- 
gram) recordings  is  an  important  target  application  due  to  the  complexity  of  brain  func- 
tions and  measured  signals.  Based  on  fractal  dimension  measurements,  EEG  recordings 
have  recently  been  studied  as  chaotic  time  series  [Gal91][Rap86][Bab86][Fre87][Des88], 
Freeman  conjectured  that  the  status  of  chaotic  behavior  may  be  a stand-by  state  of  the 
brain,  which  ensures  a fast  convergence  to  any  unforseen  activity  [Fre87],  The  Computa- 
tional NeuroEngineering  Lab.  is  also  interested  in  the  nonlinear  dynamic  modeling  of 
brain  activity.  However,  it  was  felt  that  studing  EEG  recordings  just  by  measuring  correla- 
tion dimension  and  Lyapunov  exponents  was  too  crude  and  would  have  limited  impact. 
When  we  attempted  to  model  the  EEG  using  nonlinear  predictors,  the  results  were  disap- 
pointing, and  we  decided  to  go  back  to  the  study  of  the  foundations  of  nonlinear  dynamic 
modeling.  Now,  we  think  we  have  mastered  some  important  steps  such  as  the  new  recon- 
struction method  using  Poisson  Moment  Functionals  and  noise  filtering  without  affecting 
the  dynamical  invariants.  Therefore,  we  feel  that  we  now  have  more  powerful  tools  to  pur- 
sue the  initial  goal  of  modeling  brain  activity. 
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